Open-sourcing kernel libraries for AI performance

WHAT IT IS:

QNNPACK and FBGEMM are high-performance kernel libraries that enable mobile devices and servers to run the latest AI models more efficiently. Both libraries have been deployed to production at Facebook, where they are improving the performance of computer vision models on mobile devices and speeding up computer vision models, machine translations, and other services running on our servers. We are open-sourcing both libraries so others can boost performance of their deep learning models as well as contribute any performance improvements they make in return.

HOW IT WORKS:

Deep learning frameworks (such as PyTorch) commonly use higher-precision floating-point numbers (e.g., floating-point 32 bit) to represent the weights and activations of a neural network during training. But after model training is finished, higher-precision floating-point representations and calculations become overkill. Many types of models can be adapted to use low-precision integer arithmetics for inference without noticeable accuracy loss.

QNNPACK and FBGEMM enable these high-performance, low-precision calculations for operations such as matrix multiplication and convolution, which are important in state-of-the-art deep learning architectures.

WHY IT MATTERS:

QNNPACK and FBGEMM are now publicly available, so the AI research and engineering community can use them to improve performance for reduced-precision on-CPU inference. At Facebook, QNNPACK helps mobile devices deliver a real-time implementation of Mask R-CNN, a model for semantic segmentation and keypoint estimation. FBGEMM has delivered encouraging server-side results on language-translation models, recommendation systems, and models for text understanding in images and videos.

In general, as deep learning models grow wider and deeper in search of better accuracy, they are increasingly reliant on the operations that QNNPACK and FBGEMM optimize. Furthermore, the deep learning community is continuing to move toward low-precision models, as evidenced by the newer generations of GPUs, CPUs, and specialized tensor processors that all natively support lower-precision compute primitives. This indicates that optimization through quantized inference will continue to be important for deploying new AI products.