Fighting spam with pure functions

Like any popular Internet site, Facebook is a target for abuse. Our Site Integrity engineers rely on FXL, a domain-specific language forged in the fires of spam fighting at Facebook, to quash this abuse before it can affect our users. Feature eXtraction Language (FXL) evolved in response to our need for a fast, flexible, safe way to write rules for identifying spam.

Spam threats to Facebook’s site integrity change on a daily, or even hourly, basis. Attackers peddling a “free iPad 5” scam one day might tempt users with false promises of various gift certificates the next. Fortunately, FXL provides us with the capabilities to keep pace with constantly evolving threats. FXL offers two key advantages: it is simple and easy to write, yet extremely efficient for Facebook-sized workloads.

Do we really need another programming language?

Building your own language is almost always a bad idea. We know this. In actuality, FXL is not a novel language. It’s better described as a narrowly-optimized implementation of a well-chosen subset of Standard ML (with some customized syntax). We tried hard to tread no new language ground, but instead aggressively optimize FXL for our needs.

Specifically, our use case requires that FXL fetch large numbers of data objects across the graph. Detecting and responding to spam attacks requires data from a multitude of sources, and FXL is ruthlessly efficient at fetching this data. This primary purpose gives FXL its name: Feature eXtraction Language.

Consider a few contrived spam fighting rules, expressed in FXL, for catching dangerous URLs:

If (Reputation(SharedUrl)

If (Reputation(SharedUrl) == MALWARE) Then [BlockAction, LogRequest] Else []

If (Average(Map(Reputation, PreviousSharedUrls(User, 5)))

These rules retrieve the user’s URL sharing history and fetch data from a URL reputation service. While they coherently express business logic for detecting spam, these rules are poor expressions of the optimal data fetching logic. A conventional implementation would evaluate this code top to bottom, left to right. We would fetch data sequentially, conducting an excessive number of network round trips between the machine executing FXL and the reputation service. This is a classic problem of large computer systems: naively mixing business logic with data fetching logic, resulting in pathologically bad performance. A more sophisticated approach would find a way to batch these data fetches in a single network round trip. FXL was designed to do precisely this and automate these data fetches.

Pure functions to the rescue

By making certain assumptions about the state of the environment in which we execute FXL, we are able to treat FXL as a “pure” language with no side effects. Whenever we need to run a set of rules on a piece of content, we assume that the data in our infrastructure does not change during this classification. FXL functions themselves have no side effects and do not update the data in our infrastructure. This has some important consequences:

All features and functions can be safely memoized…

1. F(X) will always be Y, no matter how many times we compute it

2. Random() is not pure, therefore not memoizable (and not allowed inFXL)

or executed lazily…

1. “False && F(X)” can safely skip F(X)

2. “If False Then True Else F(X)” as well

or safely reordered.

1. “G(F(x), F(y), …)” will give the same result, no matter which F is executed first.

2. “A(x) + B(x) + C(x)” as well

We aggressively use these properties to automatically optimize the execution of FXL.

Automatic Batching

Let’s take another look at this snippet from our example above:

Map(Reputation, PreviousSharedUrls(User, 5))

This snippet will make up to five requests to our fictional URL reputation service. Luckily for us, FXL will batch all five of these requests together and perform them simultaneously. As a result, the time to make all five requests is about the same as the time to make just one request. This is not a special property of the Map() function, as this optimization is performed across all expressions of all rules.

FXL is able to batch requests together because the order in which it evaluates these function calls has no bearing on their results (this follows from their lack of side effects). FXL will actually halt the execution of one function, begin executing a second function, and only later return to complete executing the first function.

In the call to Map() above, FXL makes five calls to Reputation(). FXL begins executing the first call to Reputation(), then halts its execution at the point it would need to fetch data from the URL reputation service. FXL then begins executing the second call to Reputation(), halting again before fetching any data.

FXL repeats this begin-and-halt procedure on the third, fourth, and fifth calls to Reputation() as well. At this point, no functions remain which have not been partially executed. No function can proceed without fetching data, so FXL fetches all the data needed by these functions in a single batch. Having obtained the URL reputation data, it can resume execution of all five calls to Reputation().

We have estimated that FXL’s batched fetching is responsible for a factor of twenty speedup when compared to a naïve execution model that fetches data eagerly.

Memoization

Because all data fetching is delayed as long as possible, it actually becomes quite easy to eliminate duplicate requests for the same data (at least within a given round of data fetching). We actually take this one step further and memoize all common FXL expressions. In other words, if two features contain two identical expressions within them, due to the properties of pure functions, those two expressions will result in an identical answer. We execute that common subexpression only once, sharing the result in both places.

Summary

FXL is a remarkably simple language that allows engineers and analysts alike to write rules to deal with abuse on the site. We crafted FXL to satisfy two constraints: 1) expressively codify the business logic of fighting spam and 2) fetch data as efficiently as possible. The automatic, and aggressive, data fetching optimizations are a direct consequence of the pure execution model.