Evaluating OpenContrail Virtual Router Performance

The OpenContrail solution uses overlays for network virtualization. Packets between tenant virtual machines are encapsulated in a tunnel on the IP fabric connecting the compute servers in the network.

A common concern with using tunnels to encapsulate packets is whether the performance will be equivalent to the non-tunneled scenario. Many server NICs that are commonly used today do not support performance optimizations, like segmentation offload for tunneled packets. Although some vendors have recently announced NICs with optimizations for overlay networks, these are not yet in common use in data center networks.

As a result, there is a need to optimize performance for the least common denominator i.e. without making any assumptions about the hardware capabilities of the server NICs. Also, one of the underlying design principles of the OpenContrail solution is to only use standard protocols and encapsulations in order to support interoperability with existing network hardware (or virtual) devices and leverage years of experience with proven protocols. So, inventing a new protocol or encapsulation to optimize for performance was not an attractive option.

OpenContrail has a module called vRouter that performs data forwarding in the kernel.  The vRouter module is an alternative to the Linux bridge in the kernel and one of its functionality is to perform tunnel encapsulation and decapsulation in software. A comparison of the forwarding performance of vRouter versus Linux bridge would give a good indication of the overhead of software tunneling.

THE TEST

The setup used to evaluate performance consists of 2 servers connected using Intel 10G NICs with a MTU of 1500. The servers have 2 CPU sockets each, with 6 cores per socket and 2 threads per core. The processor is an Intel Xeon running at 2.5GHz. The servers have 128GB of memory each and run Centos 6.4 as the host operating system. A virtual network is created and a virtual machine (VM) is instantiated on each server in this virtual network. Each VM has 1 VCPU, 2GB of memory and runs Ubuntu 12.04 as the guest operating system. A TCP streaming test is run between the VMs on the virtual network.

As shown in Figure 1 below, the netperf client application on VM1 sends a TCP stream to netserver application running on VM2.  Packets are sent over a virtual interface (vif1) from the guest into the vRouter module on the sending host, where they are encapsulated in a tunnel before being transmitted on the wire. On the receiving host, the vRouter module decapsulates the packet and forwards them to the guest over vif2. The tests with Linux bridge are exactly the same except that vRouter is replaced by the Linux bridge module.vRouter-PerfTest-Setup

There is some variability in the measured throughput with Linux bridge as well as with vRouter depending on which CPU cores the guest VM and vHost thread are scheduled. This is exacerbated in a NUMA system as a result of the overhead associated with accessing memory from a remote NUMA node. In order to avoid this variability in performance between test runs, the guest VM and the vHost thread are each pinned to a CPU core. This results in consistent numbers between test runs and allows an apples-to-apples comparison.

 On this setup, the unidirectional throughput measured with vRouter using MPLS over GRE as the encapsulation is 9.18Gbps. The CPU consumption on the sender is 128% (1.28 CPU cores) and it is 166% on the receiver. The CPU consumption includes the processing to push packets to and from the guest and does not include the CPU consumed by the guest itself. With bidirectional traffic (one TCP stream in each direction), the aggregate throughput is 13.1 Gbps and the CPU consumption is 188% on both ends.

The table below compares the throughput and CPU consumption of vRouter with the Linux bridge numbers for a unidirectional TCP streaming test.

Throughput Sender CPU Receiver CPU
Linux bridge 9.41 Gbps 85% 125%
vRouter 9.18 Gbps 128% 166%

Table 1: TCP unidirectional streaming test

The table below compares the numbers for a bidirectional TCP streaming test. The throughput below is the aggregate of the measured throughput at each end. The CPU consumption is the same on both servers as the traffic is bidirectional

Throughput CPU consumption
Linux bridge 13.9 Gbps 128%
vRouter 13.1 Gbps 188%

 Table 2: TCP bidirectional streaming test

 In order to measure the latency of communication, a TCP request-response test was run between the 2 servers. The table below compares the number of request-response transactions seen with Linux bridge and vRouter.

Request-response transactions
Linux bridge 11050
VRouter 10800

 Table 3: TCP request-response test

Data center networks often enable jumbo frames for better performance. The following table compares the performance of vRouter with Linux bridge with a jumbo MTU on the 10G interface. The guest application was modified to use sendfile() instead of send() in order to avoid a copy from user space to kernel. Otherwise, the single-threaded guest application couldn’t achieve a bidirectional throughput higher than 14 Gbps.

Throughput CPU consumption
Linux bridge 18 Gbps 125%
vRouter 17.4 Gbps 120%

Table 4: TCP bidirectional streaming test (jumbo MTU)

CONCLUSION

As can be seen from the above tables, vRouter achieves comparable throughput and latency with Linux bridge. The throughput is slightly lower with vRouter due to the additional bytes sent on the wire for the tunnel headers. The latency is slightly higher with vRouter as it uses multiple CPU cores to process packets and this incurs additional latency. However, the big advantage is that VMs running on the compute nodes can directly communicate with any hardware gateway router (such as the Juniper MX) or Networking Services like Firewalls, etc.

There are other encapsulation standards supported by vRouter – VXLAN and MPLS over UDP. Using MPLS over UDP allows the server NIC to verify checksums in hardware and reduces the CPU consumption on the receiver by about 15%, while achieving the same throughput.

In summary, the OpenContrail solution achieves close to line rate forwarding on a 10G link without depending on any performance optimizations in the server NICs. This throughput is achieved using standard protocols and encapsulations that allow for interoperability. Hence, tunneling packets in an overlay network does not add any significant overheard compared to the non-tunneled scenario.