Today, the internet (like most digital infrastructure in general) relies heavily on the security offered by public-key cryptosystems such as RSA, Diffie-Hellman (DH), and elliptic curve cryptography (ECC). But the advent of quantum computers has raised real questions about the long-term privacy of data exchanged over the internet. In the future, significant advances in quantum computing will make it possible for adversaries to decrypt stored data that was encrypted using today’s cryptosystems.
Existing algorithms have reliably secured data for a long time. However, Shor’s algorithm can efficiently break these cryptosystems using a sufficiently large quantum computer. Although large quantum computers are not a reality yet, there’s an immediate quantum-related threat that needs to be addressed: the “store now, decrypt later” (SNDL) attack, in which attackers intercept and store encrypted data today with the intention of decrypting it at a later date when a sufficiently powerful quantum computer becomes available. This makes transitioning to quantum-resistant cryptography an endeavor of key priority.
To address this issue, the cryptography community has been working on a new class of cryptosystems known as post-quantum cryptography (PQC), which are expected to withstand quantum attacks but can be less efficient (in particular, communication bandwidth wise) than its classical counterparts. The US National Institute of Standards and Technology (NIST) is close to publishing their new PQC Standards (expected to be released this summer). Meta cryptographers are actively contributing to this and other PQC standardization processes (co-authoring the BIKE and Classic McEliece submissions to NIST, and co-editing the ISO/IEC 14888-4 standard).
How Meta is approaching the migration to PQC
Meta’s applications are used by billions of people every day. Given our focus on maintaining user privacy and security, Meta continuously raises its security bar to deploy the most advanced security and cryptographic protection techniques. As part of this continuous effort, we’ve created a workgroup to migrate to PQC, spanning from our internal infrastructure to user-facing apps. This is a highly complex multi-year effort and identifying where to first place PQC protections wasn’t trivial.
After careful analysis, protecting components that are susceptible to the SNDL attack, and where we control both endpoints, has been identified as our first priority (given their migration urgency and lack of external dependencies). In particular, protecting our internal communication traffic was the most sensitive use case that checked both boxes and thus became our first migration target.
But a direct migration to PQC wouldn’t be the most sensible approach. Migrating systems to different cryptosystems always carries some risks such as interoperability issues and security vulnerabilities. For the PQC migration specifically, the risks are even greater because some of these cryptosystems are comparatively new and/or have not experienced a long period of field testing. To reduce such risks, Meta has started transitioning to using hybrid key exchange for TLS, which combines existing classical cryptographic algorithms with a PQC algorithm. In this way, we ensure that our systems remain protected against existing attacks while also providing protection against future threats.
For our deployment, we have chosen Kyber with X25519 in a hybrid setting. Kyber is the only key encapsulation mechanism selected by NIST for standardization so far. Kyber comes in different parameterizations: Kyber512, Kyber768, and Kyber1024. Larger parameterizations provide stronger security but also require more computational resources and communication bandwidth. We aim to use Kyber768 by default, while using Kyber512 in some cases where larger parameterizations lead to prohibitive performance impact, to accelerate the deployment of PQC hybrid key exchange.
How Meta is enabling PQC
Meta’s TLS protocol library, Fizz, is designed for high security, reliability, and performance. The early work on Fizz previously helped standardize TLS 1.3 (RFC 8446). Fizz now supports a range of features including various handshake modes, PSK resumption, Diffie-Hellman key exchange authenticated with a pre-shared key for forward secrecy, async I/O, zero copy encryption, client authentication, and HelloRetryRequest. The use of our own implementation has allowed us to quickly react to new features in the TLS protocol.
Fizz is mostly built on top of three libraries: Folly, OpenSSL, and Sodium. To support PQC, we make use of liboqs, which is an open source library led by world-renowned PQC experts that has received attention from both academia and industry experts. The liboqs library implements post-quantum cryptography algorithms for key encapsulation and signature mechanisms, including Kyber. Additionally, we extended Fizz with hybrid key exchange functionality, which can make use of the new post-quantum key exchange mechanisms provided by liboqs alongside existing classical mechanisms.
Challenges
Large packet size
One of the main challenges is the size of the Kyber768 public key share, which is 1184 bytes. This is close to the typical TCP/IPv6 maximum segment size (MSS) of 1440 bytes, but is still fine for a full TLS handshake.
However, the key size becomes an issue during TLS resumption. Internally, we do Ephemeral Diffie-Hellman key exchange to achieve forward secrecy, so key exchange still happens on resumption. There will also be a pre-shared key (PSK) for authentication. These PSKs are 200-300 bytes long, and the remaining ClientHello fields can run up to 200 bytes, causing the resumption ClientHello to exceed the MSS for one packet.
This poses some challenges given significant usage of TCP Fast Open (TFO) for internal traffic. With TFO, the entire ClientHello could previously ride along with the TCP SYN packet, allowing the server’s TLS implementation to start processing and have its ServerHello ready to send right after its TCP SYN-ACK packet. However, when the ClientHello is too large to fit in the first packet, TFO still happens but the ClientHello is only partially sent. The client then has to wait for the TCP handshake to complete before sending the rest of the ClientHello, and needs to wait again for the ServerHello. This adds an extra round trip time (RTT) to the whole handshake process before any application data can be sent.
After evaluating various alternatives and workarounds, and given the prohibitive key size of Kyber768, we opted to use Kyber512 in internal communications affected by this problem for now, allowing us to accelerate the PQC deployment. Kyber512’s 800-bytes-long public keys help with fitting the ClientHello into a single TCP packet, while still being considered secure by NIST. This choice ensures both security and efficient communication. In the future, an increase in MTU, or utilizing QUIC, which allows for multiple initial packets, may allow for larger ClientHellos without an additional round trip.
Multithreading problem with liboqs
After we rolled out post-quantum hybrid key exchange to our fleet, one of our internal teams started experiencing intermittent but constant segmentation fault crashes, and liboqs code was near the top of the stack trace. Here is an example stack trace:
#0 0x0000000000000000 in ?? ()
#1 <signal handler called>
#2 0x0000000000000000 in ?? ()
#3 0x0000556ea1ed5eac in keccak_x4_inc_absorb.constprop ()
We determined the problem to be a race condition that was causing a function call to call the 0 address. The issue was filed to liboqs. To explain briefly, the race condition was in the Keccak_Dispatch function, where Keccak_Initialize_ptr would be set before setting some other function pointers. Crucially, Keccak_Initialize_ptr being set or not is used by the caller of Keccak_Dispatch to determine whether to actually call it. In a multi-threaded environment, some thread could call Keccak_Dispatch, then set Keccak_Initialize_ptr and pause there. Another thread could then take the same code path, see that Keccak_Initialize_ptr is non-zero and opt not to call Keccak_Dispatch, then call some of the other function pointers that are still zero, leading to a segfault. (The same is true of the Keccak_X4_Dispatch function.)
Although liboqs is being used by a growing number of products and companies, it appears that we were the first to encounter and report this issue, possibly due to the scale of our trial deployment. We fixed it by calling Keccak_Dispatch with pthread_once on POSIX platforms. The fix has since been submitted and merged upstream.
Cross-domain resumption handshake thrash
We rolled out post-quantum hybrid key exchange progressively, with the decision driven by the client. For instance, we started with connections between different data centers, then moved on to traffic within the data center.
Internally, we scope TLS sessions by “service” name. This allows a client to perform cross-host resumption to different servers in the same service. This includes the ability to resume from a server with which the client decides to use hybrid key exchange to one where the client does not, and vice versa, which runs into a small problem with Fizz.
As previously mentioned, we do Ephemeral Diffie-Hellman key exchange on resumption. To facilitate efficient use of computation resources, the client will send only the minimally required default keyshares, which in the resumption case means the keyshare for the previously negotiated named group. This means that when a client connects to a particular server and negotiates a classical named group, then subsequently resumes on a server with which the client should use a hybrid named group, the client would advertise the hybrid named group but send only the keyshare for the classical named group. This leads to the server negotiating the hybrid named group and replying with a HelloRetryRequest to ask the client for the hybrid keyshare, resulting in an additional 1-RTT to perform the key exchange.
To address this, we had the client split each service into different TLS session scopes – one using classical key exchange, and one using hybrid key exchange. Each session scope thus uses only one named group each, avoiding the keyshare thrashing behavior described above. The tradeoff is space consumption due to having to store more session tickets, but this has been acceptable given the small size of each session ticket (a few hundred bytes).
The computational cost of Kyber key exchange
Meta currently uses X25519 in Elliptic Curve Diffie-Hellman key exchange. During the initial rollout of hybrid key exchange with the hybrid named group X25519_kyber768, we observed a roughly 40 percent increase in CPU cycles. Although this may seem like an undesirable result, it actually indicates that Kyber768 standalone key exchange is faster than x25519, which lines up with results others have found.
Current status and future plans
Meta has deployed post-quantum hybrid key exchange for most internal service communication to protect against the SNDL threat. Since internal service communication traffic occurs within our internal network and is fully under our control, this was the logical starting point for implementing this advanced security countermeasure, even as we await the PQC standards to be published by NIST.
Implementing post-quantum hybrid key exchange to external public internet traffic poses several additional challenges, such as dependency on browsers’ TLS implementations and crypto libraries’ PQC readiness, increased communication bandwidth due to larger payloads, and more. We are looking forward to industry standardization and major browser based adoption, and we’ll keep working across Meta to harden our systems as well. We look forward to sharing more as we continue our efforts in this space.
Acknowledgements
We thank the current and past members of Meta’s Service Encryption team particularly: Isaac Elbaz, Fred Qui, Keyu Man, Puneet Mehra, Forrest Mertens, Srinivas Murri, Ameya Shedarkar, and Mingtao Yang.