Debugging and System Visibility in a SDN Environment

The benefits of SDN, and the advantages of cloud computing have been well documented.  However, when moving to this new distributed environment, there better be a way to have clear visibility and understanding of how the network is performing.  When issues arise, it is essential to have the diagnostic and troubleshooting capability to solve them before any business impact to any tenant.

Looking through /var/log/messages alone just won’t cut it anymore.  An Operational CLI with auto-completion features isn’t enough either.  A distributed, horizontally scalable SDN Controller needs to provide more…

Data-center Monitoring Software integrates the monitoring of compute, storage, networking and applications in the data center. As compute and storage have been virtualized, more powerful tools have developed for managing and orchestrating them. Software and application management frameworks such as chef and puppet also help with managing distributed applications. There are customizable log collection and analysis tools that help monitor applications as well.

But, as we start looking at network virtualization, it becomes evident that the traditional abstractions and mechanisms for monitoring the network are often somewhat fragmented. There are centralized software applications that monitor network elements using interfaces such as netflow and SNMP. But these mechanisms can suffer from inconsistent implementations across equipment and often do not expose information at the right granularity or abstraction. When troubleshooting is needed, we can find ourselves correlating operational state and counters using CLIs across many individual network elements.

SDN Analytics calls for presenting a seamless single logical view of the network in one place.

SDN controllers represent some fundamental changes to how networking software and systems are built -– we need system-wide abstractions for expressing services, and horizontal scalability of the software itself.  The debugging and visibility needed for these systems must present information along the axes of basic user abstractions (Virtual Networks, Virtual Machines and their connectivity).  It’s vital that monitoring the modern multi-tenant virtualized datacenter be no more complex than the user-abstractions themselves.  That’s the principal charter for SDN Analytics.

Abstractions for Analytics:

One of the fundamental concepts of SDN is the ability to express user configuration in a centralized way, instead of as a mixture of lower-level configuration on multiple network elements of different types. In terms of user configuration, lets take the example of a tenant with Virtual Networks that have VMs inside them and interact with other Virtual and Physical networks according to certain forwarding policies.  You have software processing elements that translate the Virtual Network configuration into networking protocol constructs such as BGP Route Targets or VXLAN Identifiers. You have other software elements that implement the control protocols, and yet others that serve as the data-forwarding plane and possibly enforce ACL rules.

For monitoring and troubleshooting, it’s important to manage this complexity appropriately. In this case, we need an aggregate view of what’s happening with this Virtual Network, even though we are dealing with state that is distributed across multiple processing elements of different types.   If there are connectivity problems, we may need to check for consistency of state across processing elements.  This must also be tied to the topology of the data-forwarding elements.  Besides state information and configuration information, we need packet counters at each point, and a way to visualize and analyze the topology.

State, Stats and History from a Distributed System:

Reporting useful state from a distributed system is a matter of having the right abstractions, and having aggregation and syncing mechanisms. Syncing mechanisms have always been present in some form in network protocols; we just need to apply a similar level of rigor to SDN Analytics. We can think of the processing elements of a SDN system themselves as Network Elements forming a network. In that case, having a state aggregation mechanism is akin to a traditional NMS system correlating the operational state of Network Elements. This state aggregation enables us to view the current state of the overall system.

But troubleshooting a system is not just a matter of the looking at the current state of the system – we are often interested in how we got there. Having more granular data is an asset, as long as you can present it in a timely and actionable way. Take an example of a multi-tier application with a web front-end and database back-end. If the application is not able to serve its customers with the right SLA’s (e.g. transaction latency is much higher than expected), network troubleshooting starts with traffic flows that are attributable to the application. We’ll need the ability to slice-and-dice traffic statistics for these flows end-to-end between Virtual Network boundaries and even VMs. This can span multiple data-forwarding elements. And, we need to do it in real-time. Doing this at scale, for a large data-center, is challenging. Horizontal software scaling techniques, such as those used in NoSQL database paradigms, can help.

The Open Software Environment:

Managing the datacenter’s compute, storage, networking and applications is a fast-moving field. Architectures and solutions evolve along with data-center functional requirements, economics and available technology. SDN Analytics must provide northbound interfaces (the interface to provide information) that allow for easy integration with other orchestration and monitoring software applications. Customers may be using 3rd-party vendor applications, and/or developing software internally under their own DevOps model.

Complex software systems need to support intelligent, proactive monitoring. For example, consider a software bug in routing that results in excessive memory being used by a control plane process. By automatically tracking the memory use of a particular processing element over time, or comparing it to other similar processing elements, we can display outliers that deserve further troubleshooting analysis. SDN Analytics needs to support such detection natively, as well as provide northbound APIs for other software to this.

Open APIs are required for southbound interfaces (the interface to gather information) as well. SDN controllers don’t operate in a vacuum. The network will have a variety of both virtual and physical routers, gateways and network services appliances, and from different vendors. SDN Analytics should provide a consolidated view of all these elements.  For example, we should be able to correlate events reported by a physical router with events happening in the SDN control plane.  Its possible to extend this further, and correlate with other applications operating in the data-center. Support for standards such as netflow and syslog is helpful. In all these cases, the ability to slice-and-dice the information and search though the vast sea of data is vital.

In future blog entries we will discuss the OpenContrail Analytics solution, and examples of how it addresses these challenges and opportunities. SDN enables powerful solutions for getting agility, security, utilization and performance from the multi-tenant datacenter. Strong Analytics is necessary for humans to actually operate it.