The Facebook web tier serves billions of PHP requests every day across thousands of servers. Since new code or other changes to the site get pushed out frequently, it is critical to have near real-time performance data that is both representative of production traffic and rich in its ability to help pinpoint regressions down to specific functional areas. This blog post talks about our solution to this problem – a lightweight but powerful tool called XHProfLive. Profiling is a serious issue faced by millions of real-time services both large and small. So Facebook is happy to announce that XHProf, the light-weight and feature rich profiler that powers XHProfLive, is now made available as open source.
I’d like to go through how we got to this point, starting from the problem of profiling billions of web requests intelligently. A basic solution would be to collect, on a sampled basis, some key statistics about requests such as generation time, CPU time, number of calls to backend services like database/memcache, and time spent in these calls. This approach requires instrumenting specific parts of our PHP code (e.g., database and memcache libraries) to log the necessary statistics. Also, the performance data thus gathered is fairly coarse-grained. If there is a regression in a code path that hasn’t been explicitly instrumented, we would have little insight in diagnosing the problem. We wanted to build a system that would collect performance stats for *all* functions running in production.
However, it was clear that cost in functions alone wouldn’t be too useful without context. For example, just knowing that our queryf() database function accounted for 10% of the total cost by wall-time wouldn’t be as helpful as knowing what the breakdown of the cost by callers was. In addition to execution times, we also wanted to collect statistics about memory usage. These considerations prompted us to build XHProfLive, a performance monitoring system which gathers hierarchical (or callgraph) profiles from production on a continual basis.
Traditionally, performance monitoring systems use sampling based profilers in production because of their low overhead, but the profile information is not very detailed. XHProfLive uses XHProf, a light-weight instrumentation based callgraph profiler for PHP developed in-house. XHProf is capable of reporting function-level call counts and inclusive/exclusive metrics such as elapsed time, CPU time and memory usage. A function’s profile can be broken down by callers or callees. XHProf has a simple HTML based user interface. The browser based UI for viewing profiler results makes it easy to view results or to share results with peers. XHProf supports the ability to compare two runs (a.k.a. “diff” reports) or aggregate data from multiple runs. Diff and aggregate reports, much like single run reports, offer “flat” as well as “hierarchical” views of the profile. [Note: Although implemented for PHP, the methodology used by XHProf is also well suited for other dynamic languages such as Python and Ruby.] XHProfLive continually gathers function-level profiles from production by running a sample of requests under XHProf. XHProfLive then aggregates/rolls up these individual profiles by various dimensions such as time, page and data center and can help answer a variety of questions such as: “What is the function-level profile for a specific page?”, “How expensive is a function across the entire site, or on a specific page?”, “What functions regressed most in the last hour/day/week?”, and so on.
XHProfLive has a rich browser based UI for tracking trends in various metrics, viewing the corresponding XHProf reports, generating aggregate XHProf reports based on custom filters and plotting histograms of execution time in a function or page. A bot automatically compares current XHProfLive data with historical data on a periodic basis and generates regression alerts. Each alert contains a URL to the corresponding XHProf diff report; this makes problem diagnosis extremely straightforward. Because XHProfLive provides an accurate, detailed and real-time view of our entire site it has become one of the primary tools used for performance diagnosis and optimization at Facebook.
Again, XHProf, the light-weight and feature rich profiler behind XHProfLive, is now open source. Check out this and other open source developer tools from Facebook like codemod on our open source page.