A fine-grained network traffic analysis with Millisampler

What the research is:

Millisampler is one of Meta’s latest characterization tools and allows us to observe, characterize, and debug network performance at high-granularity timescales efficiently. This lightweight network traffic characterization tool for continual monitoring operates at fine, configurable timescales. It collects time series of ingress and egress traffic volumes, number of active flows, incoming ECN marks, and ingress and egress retransmissions. Additionally, Millisampler is also able to identify in-region traffic and cross-region traffic (longer RTT). Millisampler runs on our server fleet collecting short, periodic snapshots of this data at 100us, 1ms, and 10ms time granularities, stores it in local disk, and makes it available for several days for on-demand analysis. Since the data is only aggregated flow-level header information, it does not contain any personally identifiable information (PII). Even with the minimal amount of information it collects, Millisampler data has proven very useful in practice, particularly when combined with existing coarser-grained data — we are able to see clearly how switch buffers or host NICs, for example, might be unable to handle the ingress traffic pattern.

How it works:

Millisampler comprises userspace code to schedule runs, store data, and serve data, and an eBPF-based tc filter that runs in the kernel to collect fine-timescale data. The user code attaches the tc filter and enables data collection. A tc filter is among the first programmable steps on the receipt of a packet and near the last step on transmission. On ingress, this means that the eBPF code executes on the CPU core that is processing the soft irq (bottom half) as the packet is directed toward the owning socket. Because processing happens on many CPU cores, to avoid locks, we use per-CPU variables, which increase the memory requirement to eliminate risk of contention. To minimize overhead, we sample periodically and for short periods of time. Userspace therefore configures two parameters in Millisampler: the sampling interval and the number of samples. We schedule runs with three sampling intervals: 10ms, 1ms, and 100μs, with a fixed number of samples to 2,000 for all sampling intervals. This means that our observation periods range from 200ms (100μs sampling rate) to 20s (10ms sampling rate), allowing us to observe events at sub-RTT to cross-region RTT time scales, and, at the same time, fix the memory footprint of each run to 2,000 64-bit counters per CPU core for each value we measure.

Millisampler collects a variety of metrics. It computes ingress and egress total bytes and ingress ECN-marked bytes from the lengths and CE bits of the packets. Millisampler also soundsTTLd marked retransmits. Millisampler uses a 128-bit sketch to estimate the number of active (incoming and outgoing) connections. Using the sketch results in an approximation of the connection count that is precise up to a dozen connections and saturates at around 500 connections per sampling interval. Although there is space for additional precision, in practice, more than the actual number of connections, the qualitative variation between a few connections to dozens or hundreds of connections has been helpful toward identifying patterns of traffic with more connections (heavy incast) as opposed to more traffic with fewer connections.

Why it matters:

Millisampler is a powerful tool for troubleshooting and performance analysis. Two contrasting network performance faults that we solved at Meta in the last few years relied on our needing a fine-grained view of traffic. The first problem featured synchronized traffic bursts at fine time scales, and seeing this motivated us to build and deploy Millisampler to catch it quickly if it happened again. The second, which an early Millisampler prototype helped root-cause, featured a NIC driver bug that caused it to stop delivering packets for milliseconds at a time, thereby proving the value of Millisampler in complex investigations. While Millisampler (or Millisampler-like data) played an important role in these investigations, it was only as part of our rich ecosystem of data collection tools that track a dizzying array of metrics across hosts and a network.

Beyond such incidents, Millisampler data has also proven useful in characterizing and analyzing traffic characteristics of services, allowing us to design and deploy a range of solutions to help improve their performance. For example, we have been able to characterize the nature of bursts across a number of services in order to understand the intensity of incast and tune transport performance accordingly. We have also been able to look at complex interactions between short-RTT and long-RTT flows and understand how bursts of either affect fairness for the other. In a following post, we will look at an extension of Millisampler — Syncmillisampler — where we run Millisampler synchronously across all hosts in a rack and use that data to identify buffer contention in the top-of-rack ASICs.