Meta’s Anti Scraping team focuses on preventing unauthorized scraping as part of our ongoing work to combat data misuse. In order to protect Meta’s changing codebase from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to detect potential scraping vectors at scale across our Facebook, Instagram, and even parts of our Reality Labs codebases.
What is scraping?
Scraping is the automated collection of data from a website or app and can be either authorized or unauthorized. Unauthorized scrapers commonly hide themselves by mimicking the ways users would normally use a product. As a result, unauthorized scraping can be difficult to detect. At Meta, we take a number of steps to combat scraping and have a number of methods to distinguish unauthorized automated activity from legitimate usage.
Proactive detection
Meta’s Anti-Scraping team learns about scrapers (entities attempting to scrape our systems) through many different sources. For example, we investigate suspected unauthorized scraping activity and take actions against such entities, including sending cease-and-desist letters and disabling accounts.
Part of our strategy is to further develop proactive measures to mitigate the risk of scraping over and above our reactive approaches. One way we do this is by turning our attack vector criteria into static analysis rules that run automatically on our entire code base. Those static analysis tools, which include Zoncolan for Hack and Pysa for Python, run automatically for their respective codebases and are built in-house, allowing us to customize them for Anti-Scraping purposes. This approach can identify potential issues early and ensure product development teams have an opportunity to remediate prior to launch.
Static analysis tools enable us to apply learnings across events to systematically prevent similar issues from existing in our codebase. They also help us create best practices when developing code to combat unauthorized scraping.
Developing static analysis rules
Our static analysis tools (like Zoncolan and Pysa) focus on tracking data flow through a program.
Engineers define classes of issues using the following:
- Sources are where the data originates. For potential scraping issues, these are mostly user-controlled parameters, as these are the avenues in which scrapers control the data they could receive.
- Sinks are where the data flows to. For scraping, the sink is usually when the data flows back to the user.
- An Issue is found when our tools detect a possibility of data flow from a source to a sink.
For example, assume the “source” to be the user-controlled “count” parameter that determines the number of results loaded, and “the sink” to be the data that is returned to the user. Here, the user controlled “count” parameter is an entrypoint for a scraper who can manipulate its value to extract more data than intended by the application. When our tools suspect that there is a code flow between such sources and sinks, it alerts the team for further triage.
An example of static analysis
Building on the example above, see the below mock code excerpt loading the number of followers for a page:
# views/followers.py
async def get_followers(request: HttpRequest) -> HttpResponse:
viewer = request.GET['viewer_id']
target = request.GET['target_id']
count = request.GET['count']
if(can_see(viewer, target)):
followers = load_followers(target, count)
return followers
# controller/followers.py
async def load_followers(target_id: int, count: int):
...
In the example above, the mock endpoint backed by get_followers is a potential scraping attack vector since the “user” and “count” variables control whose information is to be loaded and number of followers returned. Under usual circumstances, the endpoint would be called with suitable parameters that match what the user is browsing on screen. However, scrapers can abuse such an endpoint by specifying arbitrary users and large counts which can result in their entire follower lists returned in a single request. By doing so, scrapers can try to evade rate limiting systems which limit how many requests a user can send to our systems in a defined timeframe. These systems are set in place to stop any scraping attempts at a high level.
Since our static analysis systems run automatically on our codebase, the Anti-Scraping team can identify such scraping vectors proactively and make remediations before the code is introduced to our production systems. For example, the recommended fix for the code above is to cap the maximum number of results that can be returned at a time:
# views/followers.py
async def get_followers(request: HttpRequest) -> HttpResponse:
viewer = request.Get['viewer_id']
target = request.GET['target_id']
count = min(request.GET['count'], MAX_FOLLOWERS_RESULTS)
if(can_see(viewer, target)):
followers = load_followers(target, count)
return followers
# controller/followers.py
async def load_followers(target_id: int, count: int):
...
Following the fix, the maximum number of results retrieved by each request is limited to MAX_FOLLOWERS_RESULTS. Such a change would not affect regular users and only interfere with scrapers, forcing them to send magnitudes more requests that would then trigger our rate limiting systems.
The limitations of static analysis in combating unauthorized scraping
Static analysis tools are not designed to catch all possible unauthorized scraping issues. Because unauthorized scrapers can mimic the legitimate ways that people use Meta’s products, we cannot fully prevent all unauthorized scraping without affecting people’s ability to use our apps and websites the way they enjoy. Since unauthorized scraping is both a common and complex challenge to solve, we combat scraping by taking a more holistic approach to staying ahead of scraping actors.