Statistics in OpenContrail Analytics

A wealth of operational data is available in a distributed, multi-tier cloud infrastructure deployment. From the activity of VMs, to the flow of networking traffic, to the performance metrics of the applications themselves. If only we could use and correlate all this data to debug known problems and anticipate potential issues, and to build closed loop control systems to manage infrastructure and applications. This requires the ability to easily report and retrieve different kinds of metrics. If we have the APIs, we can build rich applications that deliver on all these promises. This blog post describes the OpenContrail approach for achieving these goals.

Lets take the example of networking traffic flowing between Virtual Networks. For a given VN, we want to analyze the packets and bytes flowing between this VN and other VNs over a given time period. We need to slice-and-dice this information according to VNs involved, and according to the VRouters that host the VMs that sit on the VNs.

VRouterAgent processes on the VRouters periodically reports these traffic stats to the collector, which stores them in the Cassandra database. The user can query for these stats via the analytics API. The schema of the data drives the treatment of this information on the both the storage end and on the query end.

 Storage of Stats

The Sandesh framework is used to both express the schema and to generate the code to transport the information to the collector. (See this blog entry for details: Sandesh – A SDN Analytics Interface)

In this case, we send the information using the Virtual Network Sandesh UVE message. The sandesh definition is as follows:

struct InterVnStats {
 1: string                   other_vn;
 2: string                   vrouter;
 3: u64                      in_tpkts;
 4: u64                      in_bytes;
 5: u64                      out_tpkts;
 6: u64                      out_bytes;
 }
 struct UveVirtualNetworkAgent {
 1: string                   name (key="ObjectVNTable")
 2: optional bool            deleted
 …
 23: optional list<InterVnStats> vn_stats (tags=".other_vn,.vrouter")
 …
 }
 uve sandesh UveVirtualNetworkAgentTrace {
 1: UveVirtualNetworkAgent               data;
 }

The “tags” annotation identifies the “vn_stats” attribute as a list of statistic samples. The stat samples will be stored against multiple tags. This makes it efficient to retrieve and aggregate stats samples that match given tags. The UVE key is always used as a tag called “name”. The Source of the message (hostname where the sending process is running) is also always used as a tag called “Source”.  All the tags listed in the “tags” annotations are also used.  The “tags” annotation is expected to be a comma separated list. Any entry starting with “.” indicates an attribute of the stat list struct. (InterVnStats in this case). Other entries would refer to the UVE struct (UveVirtualNetworkAgent in this case).

The VRouterAgent can send multiple samples of this stat in the same sandesh message (note that vn_stats is a list).  All these stat samples will share the same timestamp (the timestamp of the sandesh message), but each stat sample will the assigned its own UUID.

At the current time, the VRouterAgent reports the inter-VN stats for all the VNs present on it every 30 seconds. All packet counts and byte counts are reported in terms of the change in the counter since the last report. This allows us to easily query for the aggregate of these counts over an arbitrary time period.

Querying of Stats

OpenContrail’s Analytics API provides a SQL-Like interface for querying any time-series information, like Statistics. (See this blog entry for details : OpenContrail Analytics Query API )

The table name for the query depends on the UVE struct name and the stat attribute name. In this case, it will be StatTable.UveVirtualNetworkAgent.vn_stats

The following schema is exposed for this stats query:

http://10.84.25.31:8081/analytics/table/StatTable.UveVirtualNetworkAgent.vn_stats/schema

   type: "STAT",
 columns: 
[
 { datatype: “string”, index: true, name: “name” },
 { datatype: “string”, index: true, name: “String” },
 { datatype: "int", index: false, name: "T" },
 { datatype: "int", index: false, name: "T=" },
 { datatype: "uuid", index: false, name: "UUID"},
 { datatype: "int", index: false, name: "COUNT(vn_stats)" },
 { datatype: "string", index: true, name: "vn_stats.other_vn"},
 { datatype: "string", index: true, name: "vn_stats.vrouter" },
 { datatype: "int", index: false, name: "vn_stats.in_tpkts" },
 { datatype: "int", index: false, name: SUM(vn_stats.in_tpkts)"},
 { datatype: "int", index: false, name: "vn_stats.in_bytes” },
 { datatype: "int", index: false, name: "SUM(vn_stats.in_bytes)” },
 { datatype: "int", index: false, name: "vn_stats.out_tpkts" },
 { datatype: "int", index: false, name: "SUM(vn_stats.out_tpkts)" },
 { datatype: "int", index: false, name: "vn_stats.out_bytes” },
 { datatype: "int", index: false, name: "SUM(vn_stats.out_bytes)” },
 ]
 }
 
 All the column names listed above can be used in the “select” clause. Those that have index=true can also used be in the “where” clause.

The select clause controls the columns that appear in the query output. Lets look into how we support retrieval of stats samples and aggregation of these samples.

The field “T” refers to the timestamp in microseconds. “T=” refers to a rounded-down timestamp.  For example, T=60 will report the timestamp rounded down to a number divisible by 60 seconds. This capability is used to group together all samples that belong to a 60 second time period. This feature is called “binning”.

The fields that start with SUM and COUNT are aggregate fields. These aggregate fields are provided for every numerical attribute. Except for the aggregate fields, the combination of other fields is guaranteed to be unique in each row of the output.

Using this uniqueness property, in conjunction with binning and aggregation, gives us powerful ways of slicing-and-dicing data, as the examples will illustrate.

Query Examples

Queries can be issued via the Analytics API:

POST to http://10.84.25.31:8081/analytics/query

The “contrail-stats” command can also be used from the command line:

# contrail-stats --help
 usage: contrail-stats [-h] [--opserver-ip OPSERVER_IP]
 [--opserver-port OPSERVER_PORT]
 [--start-time START_TIME] [--end-time END_TIME]
 [--last LAST]
 [--table {AnalyticsCpuState.cpu_info,ConfigCpuState.cpu_info,ControlCpuState.cpu_info,ComputeCpuState.cpu_info,SandeshMessageStat.msg_info,GeneratorDbStats.table_info,GeneratorDbStats.errors,FieldNames.fields,FieldNames.fieldi,QueryPerfInfo.query_stats,UveVirtualNetworkAgent.vn_stats,DatabasePurgeInfo.stats}]
 [--dtable DTABLE] [--select SELECT [SELECT ...]]
 [--where WHERE [WHERE ...]] [--sort SORT [SORT ...]]

1. Query for samples

What are the raw samples reported from vRouter nodea8 for the virtual network default-domain:demo:vn1 over the last 60 seconds?

# contrail-stats --table UveVirtualNetworkAgent.vn_stats --where "name=default-domain:demo:vn1 AND vn_stats.vrouter=nodea8" --select T vn_stats.other_vn UUID vn_stats.out_bytes vn_stats.in_bytes --last 1m
 {"start_time": "now-1m", "sort_fields": [], "end_time": "now", "select_fields": ["T", "vn_stats.other_vn", "UUID", "vn_stats.out_bytes", "vn_stats.in_bytes"], "table": "StatTable.UveVirtualNetworkAgent.vn_stats", "where": [[{"suffix": null, "value2": null, "name": "name", "value": "default-domain:demo:vn1", "op": 1}, {"suffix": null, "value2": null, "name": "vn_stats.vrouter", "value": "nodea8", "op": 1}]]}

sept8_post_image1

2. Query for total aggregates

What is the total traffic exchanged between Virtual Network default-domain:demo:vn1 and every other Virtual Network over the last 1 hour?

# contrail-stats --table UveVirtualNetworkAgent.vn_stats --where "name=default-domain:demo:vn1" --select vn_stats.other_vn "SUM(vn_stats.out_bytes)" "SUM(vn_stats.in_bytes)" "COUNT(vn_stats)" --last 1h
 {"start_time": "now-1h", "sort_fields": [], "end_time": "now", "select_fields": ["vn_stats.other_vn", "SUM(vn_stats.out_bytes)", "SUM(vn_stats.in_bytes)", "COUNT(vn_stats)"], "table": "StatTable.UveVirtualNetworkAgent.vn_stats", "where": [[{"suffix": null, "value2": null, "name": "name", "value": "default-domain:demo:vn1", "op": 1}]]}

sept8_post_image2

Each row of the query output above represents a unique value of vn_stats.other_vn. Furthermore, SUM(vn_stats.out_bytes) for a given row is obtained by adding together all the samples that had the given value of vn_stats.other_vn. Also, COUNT(vn_stats) tells us the number of samples that this output row represents. (since there is a single vRouter in this setup sending one sample every 30s for each output row if the query above , we are seeing 120 samples per row for a 1 hour query)

 3. Query for binning

What is the traffic between Virtual Network default-domain:demo:vn1 and Virtual Network default-domain:demo:vn2 over the last 1 hour, with binning period of 300 seconds?

# contrail-stats --table UveVirtualNetworkAgent.vn_stats --where "name=default-domain:demo:vn1 AND vn_stats.other_vn=default-domain:demo:vn2" --select T=300 "SUM(vn_stats.out_bytes)" "SUM(vn_stats.in_bytes)" --last 1h
 {"start_time": "now-1h", "sort_fields": [], "end_time": "now", "select_fields": ["T=300", "SUM(vn_stats.out_bytes)", "SUM(vn_stats.in_bytes)"], "table": "StatTable.UveVirtualNetworkAgent.vn_stats", "where": [[{"suffix": null, "value2": null, "name": "name", "value": "default-domain:demo:vn1", "op": 1}, {"suffix": null, "value2": null, "name": "vn_stats.other_vn", "value": "default-domain:demo:vn2", "op": 1}]]}

sept8_post_image3

This is a “binning” query. The time period is specified using T=<x> in the select clause. We want to sum up 300 seconds worth of samples in each output row. This is useful for drawing graphs, with T on the X axis and the aggregated traffic on the Y axis. Each point on the graph would represent 300 seconds of aggregated traffic.

Since the T= column reports the timestamp rounded down to 300 seconds, all samples within a 300 second window get mapped into a single output row.  The SUM(vn_stat.out_bytes) column for the output row is then derived by adding together the vn_stat.out_bytes values of all those samples.

Conclusion

OpenContrail provides a generic, schema-driven way of reporting and querying for statistics. The slicing-and-dicing algorithms are based on an abstract binning-and-aggregation model, and do not depend on the semantics of the data itself. Only the agent generating the samples and the application making the query need to understand the semantics of the data. This makes it possible to add new sample-generating agents and analytics applications in an agile way by simply adding schemas; modification of core OpenContrail Analytics software is not needed.