Microsoft has decided to open source the code behind ONE, the system that emulates the global network powering the Azure cloud to help reduce network outages.
What is ONE or the Open Network Emulator?
The system is called the Open Network Emulator (or ONE for short). It works by simulating in software all the software and hardware devices that constitute a network and the ways in which they are interconnected. It runs in Docker containers and VMs, and critically allows network engineers to test changes before they are deployed to production.
Once code changes are made on the live network, it’s imperative that they do not involve major errors and cause downtime. Tools such as ONE, which help engineers evaluate the potential impact of a change before deploying it live, are extremely important as even a simple DevOps error can cause network downtime. At the network size of Microsoft Azure, such an error could potentially bring down an entire region. If the network is broken, packet loss ensues – followed by downtime, which could potentially impact millions of people.
Cloud networks are also constantly changing. They are made up of highly complex and enormous cloud-scale production networks that interconnect hundreds of thousands of servers and tens of thousands of network devices, which are sourced from multiple different vendors and deployed worldwide with precise needs for reliability, security and performance. The pulse of the networks needs to be continually monitored in order to detect anomalies, faults, and drive recovery at the level of milliseconds. In essence, networks are the cloud as they are the core infrastructure that enable cloud services and help to deliver availability across its other fundamental services from compute to storage.
Until recently, there have been few available tools that help cloud providers proactively anticipate the impact of planned updates and changes, or the consequences of a bug in the system. Microsoft decided therefore to build a virtual copy of the entire Azure network infrastructure i.e. a large-scale, high-fidelity network emulator. This allows engineers to validate planned changes ahead of time and assess the impact of updates and explore different failure scenario.
Victor Bahl, director of mobility and networking at Microsoft Research, described the way in which the emulator works in a recent podcast, published by Microsoft Research.
“The way the emulator works is that now, networking engineers and operators, when they do changes, they make the change, but they actually don’t even know if they’re making the change to the network. They actually are changing the emulator. Because it mimics the network underneath so amazingly that you can’t tell the difference. So, once you make the changes, the emulator will then try them out and make sure everything is good. Once everything is good, it’s going to go and put in on the network below, and voila, if we did the work right, it should all work. Now, this is how we achieve availability, and it’s been phenomenal.” Bahl also explained why Microsoft had decided to open source the technology: “And so now, we have decided that this is such an important resource for everybody that just hoarding it ourself is not the right thing to do. So, we are making it available to the entire community.”
Why is Microsoft Open Sourcing ONE?
By opening up public access to ONE, Microsoft both intends to help its large enterprise customers improve their network uptime in their own enterprise networks, and offer students, in addition to academic and industrial researchers, a useful tool to simulate the kinds of hyperscale networks that power the big tech players like Microsoft, Amazon and Google run on (without having actual access to the networks themselves). According to Microsoft, it will also provide networking product vendors with the opportunity to test new control-plane software at scale.
Microsoft first confirmed plans to open source ONE at the Sigcomm Conference back in June.
“Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment, even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages,” Microsoft researchers explained in the description of ONE the company submitted for Sigcomm. “Therefore, the ability to validate the impact of every planned change in a realistic setting, before the change is deployed in production, is crucial to maintaining and improving the reliability of our network”.
However, it was officially announced at the Microsoft Research Facility Summit 2018 earlier this month where it was hailed as “the big announcement” of the summit, in which current information and results from Microsoft’s product and research group are presented to other researchers and leaders in the broad systems research area in computer science. It has not been disclosed when the open sourcing will take place.
CrystalNet First Revealed to the Public
The system was first revealed to the public last November. At that time, it was called CrystalNet – named for a fortuneteller’s crystal ball – as the system was built to reveal the network’s future.
The construction of CrystalNet was a collaboration between Microsoft Azure and Microsoft Research teams. The teams applied two years’ worth of research in order to create the emulator.
The Azure team had been using CrystalNet in production for six months back then and had apparently already been able to prevent several potential incidents during that period of time. They ran tests of tools and scripts before transitioning Azure data centers to new regional backbone networks, for example, a risky operation; and in doing so found over 50 bugs, a number of which could have caused potential outages.
The final migration plan, having been run and perfected on CrystalNet, didn’t trigger a single incident. There were also no reports of human error, which engineers attributed to being able to run intensive practice sessions on the emulator. The new regional networks bypass Microsoft’s existing data center WAN and instead interconnect data centers within a specific geographic region.
Azure Networking were also able to use it as a realistic test environment for building network automation tools and developing Microsoft’s in-house switch operating system, Software for Open Networking in the Cloud, called SONiC. In doing so, they have been able to detect bugs they had previously missed via unit testing and a small testbed test. When the full production environment was emulated using ONE, however, these bugs quickly showed up.
What Makes ONE So Valuable?
- Scalability: ONE builds virtual network links among emulated devices spread over multiple different Virtual Machines (VMs). ONE is able to naturally scale to emulate larger and growing networks by adding additional VMs into the emulation cluster.
- Flexibility: ONE supports a wide range of software images of network devices from various different vendors. These can either take the form of standalone VMs or Docker containers. To uniformly accommodate and manage these, Microsoft mocks up the physical network with containers and runs heterogeneous device software images on top of its containers’ network namespace. This means that network engineers don’t need to learn new device management tools as they can access the device images via Telnet or SSH. Real hardware devices can also be transparently included within the Open Network Emulator.
- Cost and Operational Efficiency: ONE automatically identifies boundaries to help keep to a minimum the number of devices that need to be emulated. ONE thereby both makes cost efficiency savings and ensures that the behavior of the emulated network will be identical to that of the real network.