OpenStack Neutron at scale

“Neutron cannot scale.” – This is a pet peeve of several folks, I often meet whether in the customer or vendor or open-source community. I don’t blame them, because their misgivings are not unfounded. In fact, as some of the OpenStack vendors set out to demonstrate the scalability of OpenStack, they had to disable Neutron – read the recent posts from Mirantis and Canonical.

As I read through the above posts, where they explain one bottleneck after another, I realize how and why the scalability of Neutron is questioned by so many, so frequently. In the Canonical scale tests, the system had so many timeouts and failures, that the ‘Neutron security group’ feature had to be turned off and eventually Neutron was completely turned off and replaced with ‘nova networking’. Yes, you read it right ! Because of scale limitations, while the world moved ahead from nova-networks to neutron, these test deployments had to move back. The reason for this limitation is unlike other OpenStack services it difficult to scale out out-of-the-box neutron server. When you enable multiple workers in Neutron there are race conditions to MySQL. Neutron Server and Neutron Agent architecture also poses challenges. Refer to the presentation from HP specifically talking about the above challenges.

OpenContrail, on the other hand, is a system that has been designed right from day one for scale out. With OpenContrail, a layer 3 overlay is created using a vRouter in the kernel of each of the compute nodes. The policies are defined centrally at the Controller and enforced in a distributed fashion within the vRouter. OpenContrail by default, does all services in distributed fashion – distributed routing, DHCP, floating IP and DNS – and that helps it achieve the scale that default Neutron cannot.

When we saw the scalability concerns from the community, we thought of demonstrating OpenContrail capabilities by running a few basic scale tests. Seeing is believing – so here’s a video of the test that we’ve performed.

 Some of the salient features of the tests we conducted are:

  • Contrail cluster scaled up to run 1,000 compute nodes (hence 1,000 vRouters)
  • The cluster uses just 3 instances of the Controller (Control, Analytics and Config Nodes)
  • 55 Tenants (or Projects)
  • 260 Virtual Networks, together hosting 5,394 virtual machines

What we demonstrated is the creation of several security groups to allow traffic between the different virtual networks within a tenant. While traffic ran, some security groups rules were modified and traffic to those virtual networks stopped, without disturbing other ongoing traffic flows, and upon adding those security group rules back, normal traffic resumed.

What you will find particularly interesting in the demo, is that despite the large number of compute nodes, virtual networks and virtual machines at play, the policies defined at the Neutron level is propagated and enforced on all the compute nodes immediately. Secondly and more importantly modification of some of the security group rules, did not bring down all flows (which happens quite often in other implementations). And finally, the network footprint of the OpenContrail components is really small – and that really highlights the efficiency of the implementation. Below is a quick snapshot of the 1,000 compute nodes in action.

openstack_neutron_blogpost_image

So, to conclude, “Neutron cannot scale” is just a belief that can be overcome. It just depends on how you implement the Neutron backend. We have tested it with just 1,000 vRouters and we expect this number to scale to several times, without affecting the performance of the system. We will try to demo a further scaled up environment, under higher load conditions in a future demo post. Meanwhile, I would encourage you to take OpenContrail for a scale test-drive and run large numbers of virtual networks and virtual machines, send heavy traffic and see how the system responds. And yes, do report the results back to the community, so others can learn and replicate it.