The world's fastest unified matrix multiplication

April 20, 2023

Abdul Dakkak

AI Compiler Engineer

Chad Jarvis

AI Performance Engineer

Eric Johnson

Product Lead

Hengjie Wang

AI Performance Engineer

Ian Tramble

AI Performance Engineer

"Matmul", a microcosm of AI performance

In our previous blog post, we described why AI needs to solve its compute fragmentation problem to reach its full potential and how matrix multiplication ("matmul") exemplifies why this remains an unsolved problem. In this post, we describe Modular’s approach to solving this problem and its game-changing benefits, including a new standard in state-of-the-art (SOTA) performance on CPU as compared to existing solutions.

Before we get there, however, let’s recap where existing implementations fall short and why building a generalizable solution is so difficult. Remember, the AI industry today is bound by hardware performance and memory capacity. The result has been the development of a plethora of diverse, parallel hardware architectures and supporting highly-optimized kernel libraries by hardware vendors.

The problem for AI developers is that these kernel libraries are monolithic “point solutions” that each support only a small subset of the industry's hardware and use cases. They are often written in assembly to maximize performance, but as a result, they sacrifice composability, hackability, and portability to multiple hardware architectures. And they are large in code size due to their need to specialize on specific shapes and data types.

A novel approach

Capitalizing on years of experience building AI infrastructure that has scaled to billions of users, Modular has invented a new & novel approach to solve this industry-wide problem. To do so, we took a complete first-principles rethink of the entire stack, and built something that is truly differentiated and unique in the industry today.

Instead of following a traditional approach of writing hard-coded kernels or a matmul compiler, we built a much more general and extensible technology that combines the best features of both approaches. This technology enables kernel authors to quickly develop high-performance kernels that span shapes, layouts, data types, and hardware architectures. Our event on May 2 (you should tune in!) and a future blog post will talk more about how our technology works, while this post focuses on the benefits and contributions of our approach.

Unification starts with a single source of truth

If you dig into the source code for libraries such as OneDNN, you find many implementations of matmul – each hard-coded and specialized for different use cases. You’ll find one for each data type (FP16, F32, F64, Int8), various memory layouts (transposed or non-transposed), special aspect ratios (square or tall-and-skinny), various instruction set features, and more.

These fragmented point solutions make it difficult for engineers to improve the library for all possible use cases because there is too much code. This also leads to problems for users because these libraries take up a lot of disk space, swelling containers and distributions. For example, OneDNN is 51MB, MKL is 14MB, and cuBLAS is 150MB.

Modular combines what would typically be many bespoke hardware-specific implementations into a “Single Source of Truth.” As a result, expert kernel authors can build a single composable, extensible, and portable code base across architectures and use cases. And this approach enables rapid reuse of patterns and code, applicability to optimized sub-variants of problems, and easy adoption of exotic hardware features in special cases. The Modular implementation of matrix multiplication is typically less than 100kb of machine code, which makes it practical to use in many different use cases, including mobile, web, and IoT.

Performance portability

Implementing a performant matmul for any individual chip is challenging, as we discussed in the previous post. The challenge is compounded, however, by the adoption of heterogeneous hardware in the AI industry, including various flavors of CPUs, GPUs, TPUs, and so much more. Yet, today’s kernel libraries only natively support a very limited number of target architectures.

For example, while OneDNN supports ARM cores, it is implemented as a wrapper around ARM’s own hardware-specific software library, ARM Compute Library (ACL). Meanwhile, OneDNN is not optimized for AMD, which has its own fork of OneDNN called ZenDNN, which leverages AMD’s Advanced Compute Library (AOCL). Mobile is another can of worms, where libraries such as Google’s Ruy are often used. 

All these bespoke libraries pose a significant problem for AI frameworks and, ultimately, for the users of those frameworks. Frameworks get fragmented with different variants and forks, and users must mix and match different versions with different bugs and tradeoffs. This can introduce a big gap between theoretical performance and achieved performance because users often don’t know (or don’t want to know!) about this level of software.

At Modular, we love all the world’s hardware, and the generality of our approach extends well to many kinds of architectures. This allows us to provide a unified solution that defragments the framework software above the kernels. This also makes it much faster to implement high-performance support for new hardware types, with a comparably tiny engineering team and significantly less cost.

Dynamism

Some systems use kernels that are compiled “just in time” (JIT) or “ahead of time” (AOT) by advanced AI compilers, including Google’s XLA, Apache TVM, and OneDNN Graph. These generate kernels specialized for specific matrix sizes. This reduces the code size of the distribution but still requires many kernels to exist at execution time. Other libraries, such as MKL, special case individual matrix sizes in their kernel library that popular models use.

These challenges have become even more problematic given the rise of dynamic language models like BERT (and countless large language models, segmentation models, object detection, and so on!), which need to work on inputs (text or images) of arbitrary size. The challenge is that the system only knows the input size at inference time, not at model training or compilation time. Systems based on static shapes require padding the input or swapping many versions of a model specialized for different sizes, often leading to low performance, large code size, and model management frustration.  Some systems try to solve this with JIT code generation, but this introduces problems with unpredictably long tail latencies.

The Modular approach completely eliminates these problems by fully supporting dynamism. Modular’s matrix multiplication and other kernels are fully dynamic shape friendly (without JIT or AOT specialization) and support other forms of dynamism (e.g., irregular control flow, unusual data types, etc.) that many existing systems struggle with. This delivers a much simpler and more predictable system overall.

Composability

In a neural network, a matmul is seldom performed in isolation. Typically, there are other operations (e.g., activation functions or elementwise operations) that are done before and after it. It is well known that “fusing” the code for these other operations into the matmul can produce significant performance benefits by improving memory locality and reducing dispatch overhead. There are two common approaches to address this problem – providing a limited number of pre-fused special cases or providing a domain-specific compiler to perform fusion.

The first approach is the best known and is widely used by both TensorFlow and PyTorch, as well as many other specialized frameworks (ONNXRuntime, TFLite, TensorRT, etc.). This approach is very powerful and flexible because researchers and domain experts who are not compiler engineers can extend the system. The challenge is that there are a vast number of potential combinations of operators (this is one of the reasons why TensorFlow and PyTorch have thousands of kernels!). Hand-fusing these operators further exacerbates the code size and maintainability problems discussed above.

The second approach provides a different point in the tradeoffs space – AI compilers like OneDNN graph, XLA, and NVFuser provide a wide range of kernel fusions without having to special case them all. Unfortunately, they force you to choose from a small fixed operator set without extensibility. Also, while novel fusion can provide great benefits, these compilers often don’t meet the performance of traditional human-authored fused kernel libraries.

The Modular approach provides both benefits – it supports generalized fusions with a wide range of operators without having to manually write and maintain variants. More importantly, the Modular approach allows generality and extensibility without having to recompile the system and without having to be a compiler engineer. We think it will enable major new research avenues and applications by experts who may not know compiler internals.

Unparalleled performance

While flexibility, generality, and usability sound great, they aren’t worth anything if they come at the expense of performance. Performance costs directly drive operational costs, and all businesses want to be more efficient. We’re excited to share some of our early results on CPU (GPUs are coming soon!), even though they are just the beginning for the Modular system, and we have a lot of work left to do.

For our analysis, we decided to look at a range of comparable systems available today in AWS, specifically Intel Skylake (c5.4xlarge), AMD Zen-2 (c5a.4xlarge), and Amazon Graviton 2 (c6g.4xlarge). This shows two completely different instruction sets (Intel and AMD are X86-64, Graviton is ARM Aarch64) with three major vector designs (AVX-512, AVX2, and NEON, respectively) and at three different vector lengths (512-, 256-, and 128-bits long).

We measure the Modular approach against the best-known SOTA libraries on the corresponding systems – MKL and OneDNN on Intel, AOCL on AMD (the underlying library for ZenDNN), and ACL and Ruy on ARM. We also include Eigen because it is a widely used kernel library that has been ported to many architectures. We use the latest version of each of these at the time of writing – specifically, we use MKL v2023.1.0, OneDNN v2023.1.0, Eigen v3.4, AOCL v4.0, ACL v23.02.1, and Ruy (pulled from main #363f252).

Methodology

For our evaluation, we followed the same benchmarking methodology as Google Benchmark, where each benchmark is first warmed up and repeatedly run until 2 seconds have elapsed. For libraries that require extra setup, we perform the setup outside the main benchmarking loop. To avoid interference and increase stability, we ensure that each benchmark invocation gets a cold cache each time and disable hyperthreading to improve benchmark stability. While the modular implementation does support pre-packing, not all of the libraries we are evaluating support it. As such, to maintain fairness we do not benchmark our implementation with pre-packing enabled. We note, we have benchmarked our prepacked implementation against libraries that support prepacking and are competitive with them.

As we discussed in our previous post, matrix multiplication is used for a wide variety of use cases. For this study, we decided to measure ourselves against the most important shapes in the AI industry, which are most likely to have been optimized by existing libraries. As such, we selected matrix shapes mined from popular AI models such as BERT (with sequence lengths of 128 and 256), GPT, and DLRM. The shapes listed are in the MxNxK form, where the left-hand side operand of matmul has a size of MxK, and the right-hand side has a size of KxN. The shapes are ordered by their importance to the end-to-end model execution.

Finally, while there are many interesting data types like Int4, FP8, and bfloat16, we wanted to keep things simple and comparable. The Modular system can, of course, support any and all types out there, but for this analysis, we focus on FP32, which is widely used and tuned by each implementation we reference.

Performance results

With this in mind, we start by looking at the single-threaded performance on the Intel Skylake (c5.4xlarge) system. Single-threaded performance doesn’t utilize the entire chip but helps normalize results (i.e., removing higher-order factors like multi-processing, false sharing, NUMA issues, etc.). Beyond that, it forms the foundation of multi-threaded performance and is essential for certain use cases in mobile and game engines.

The figure below shows the performance in GigaFLOPs per second (GFLOP/s) of the Modular approach and other SOTA implementations for the Intel system – MKL, OneDNN, and Eigen. The Modular matmul implementation achieves performance that is on par or better than SOTA existing solutions. In fact, we are roughly 1.5 times faster than OneDNN on this Intel system.

While the single-threaded performance is a useful datapoint, full multi-threaded performance is what typically matters for server use cases. This also puts much more stress on the machine, as a full peak implementation can run into limitations like peak FLOP/s, DRAM bandwidth, cache utilization, etc. Below we show Modular’s performance against these systems – Modular is 1.46 times faster than OneDNN, which is a remarkable achievement given the generality and other benefits we discussed before.

While strong results on one hardware platform are important, a fundamental value proposition of the Modular implementation is that a single source of truth can deliver high performance across a wide range of hardware. Let’s look at Modular’s performance on AMD hardware, when compared to the AOCL library (which is the SOTA on AMD and is the backbone of ZenDNN) and the OneDNN library we saw above.

Looking at the graph below, you can see that the performance of OneDNN does not translate to AMD hardware, and while the AOCL library provides significant uplifts, the Modular approach is approximately 1.6 times faster than SOTA on the AMD system.

We also performed the same experiment on the Amazon Graviton 2 system, this time including the Ruy and ACL libraries. Ruy is the library used by edge frameworks such as TensorFlow Lite, and ACL is the backbone of the OneDNN support for ARM.

Even though ACL and Ruy are both competitive on ARM, the Modular implementation achieves significantly better performance on average – 1.8 times better than ACL and 1.2 times better than Ruy.

In addition to comparing different implementations of matrix multiplication on a given system, it is also interesting to cross-compare the absolute performance of these different systems. These are very complicated machines with a lot of moving parts, but we can look at things at a coarse grain. The Intel system benefits from having 512-bit long vectors instead of shorter 256- or 128-bit vectors. The Graviton 2 system performs well despite a shorter 128-bit vector because it has 16 physical cores, compared to the 8 physical cores on the Intel and AMD systems.

Kernel fusion aware

Finally, to demonstrate the composability of the implementation, we will look at how matmul can be fused with other operations. We want to compare against common operations that other implementations have highly tuned, so we use a “fully connected” (FC) block, defined by the equation "activation(matmul(A, B)+bias)," where activation is an activation function (we use Relu below). Libraries such as OneDNN have fused paths for the FC block, and to make the results fair, we do not compare against libraries that do not provide a way to define the FC block in a fusible fashion. Therefore, in this analysis, we only compare against OneDNN and Eigen, since they provide a way to express the fusion patterns. For consistency of presentation, we use the same shapes and three hardware configurations as above.

Below we can see the performance of the FC block on the Intel Skylake architecture. Despite the generality and flexibility of the Modular approach, it sets a new SOTA, outperforming OneDNN by 1.45 times and Eigen by 1.8 times on average.

We see similar strong results on the AMD system, where the Modular approach delivers a 2.1 times performance advantage over OneDNN and 2.3 times performance improvement over Eigen.

The Amazon Graviton 2 system is a significantly different architecture, and the Modular stack is not as tuned as it is for X86-64. Still, even here, we can see that Modular delivers a 1.3 times performance improvement over Eigen and 1.1 times over Ruy. OneDNN/ACL does not provide a fused FC layer for ARM systems.

As we can see from the data above, kernel fusion can provide significant performance uplifts when implemented right, and the benefits are even more significant as the fusion region grows. Modular’s approach was built to embrace fusion from the beginning, which allows it to support a very general set of fusions (i.e. far beyond elementwise operations, and not limited to matmul). We think that that delivering a single source of truth implementation that is performance portable, dynamic, and composable is a key contribution that will enable new research and production use cases.

What’s next

While we are excited about our performance results, the most important thing about them is that we can achieve them without compromising on our original goals and that these are just the beginning! The Modular matmul has a single source of truth, and supports many different architectures, dynamic shapes, and extensible fusions. Beyond delivering many “today” benefits to our users, the generality of our architecture allows us to radically simplify the stack above, produce a more predictable user experience and enable rapid bring-up of new hardware in a way that people haven’t experienced before. 

It is also worthwhile to emphasize that today’s existing SOTA implementations are the product of decades of research and development by many incredibly talented engineers. Modular has a talented but comparatively small team and has been able to deliver strong results quickly because of key technology advances in our stack. But even more than that - it’s also the result of a first-principles rethink and a willingness to truly invest in a “rebuild from the bottom-up” approach that Modular was founded on.

This is all part of our broader vision to enable AI to be used by anyone, anywhere, and enable AI to truly impact the world in a more meaningful and useful way. By building and creating novel approaches to AI infrastructure, we imagine a world where we can help push forward the entire industry to enable it to develop and deploy AI systems faster, more efficiently, safer, and ultimately be more accessible to the whole world. We hope to empower the entire hardware industry to build new and novel compute architectures that drive hardware innovation forward through software.

While we show matmul performance in this blog post, we have applied the same methodology across the stack. Tune in to the launch to see how these improvements translate to end-to-end performance and the technology that enables this. We are excited to explain more about how it works soon. Sign up for our upcoming May 2, 2023 launch event at Modular.com to learn more. Additionally, Modular is growing its exceptional team – if you are interested in building and driving forward the state of the art of AI infrastructure, please check out the posts on our careers page.

Abdul Dakkak
,
AI Compiler Engineer
Chad Jarvis
,
AI Performance Engineer
Eric Johnson
,
Product Lead
Hengjie Wang
,
AI Performance Engineer
Ian Tramble
,
AI Performance Engineer