How to Be Confident in Your Performance Benchmarking

March 19, 2024

Konstantinos Krommydas

Mojo Kernel Engineer

Mojo as a language offers three main benefits, namely the 3 P’s: Performance, Programmability and Portability. It enables users to write fast code, do so easier than many alternative languages, and allows code to be run across different CPU platforms, with GPU support on the roadmap.

With a growing developer community, Mojicians all over the world have been writing all kinds of applications using Mojo. The Mojo GitHub repo has more than 17K stars and our Modular Discord channel has over 22K members! One of the things new users frequently comment on is how easy it is to learn and - oftentimes to their surprise and disbelief - how fast their applications run compared to prior implementations in other languages!

Mojo performance sounds too good to be true, so…

Here at Modular we are not surprised by the above, knowing the huge capabilities (and even higher potential) of Mojo. However, we understand that different communities can (and should!) show healthy skepticism of performance claims that seem to be “too good to be true”. This blog post shares 10 fundamental benchmarking practices, both in general as well as in the context of Mojo, to enable all Mojicians to be confident when presenting their Mojo performance results!

Many of the points in the list below are obvious and known to a lot of programmers out there, however we want to make it easy and welcoming to programmers of all backgrounds and levels to experiment with Mojo and appreciate the speed benefits it brings! We encourage all Mojicians to understand the best ways to benchmark performance of their Mojo applications and employ them when sharing performance numbers. Following the spirit of the below basic benchmarking principles also makes it easier for Modular to highlight your project in our social media (X, LinkedIn), or - who knows - even get you invited to a future ModCon to present your project in-person!

Performance benchmarking principles

Here are ten basic principles of performance benchmarking. Most of them contain a list of things to be aware of in regards to a principle;. Benchmarking is a deep field, but we've attempted to cover as much ground as possible within the confines of a blog post!

1. Measure a common use case that matters. Don’t cherry-pick.

Mojicians’ expertise spans many specialized domains. Given the budding popularity of Mojo, it is highly probable that other experts (like in AI, physics, bioinformatics, etc.) will come across your benchmark and the case you pick will be taken into account while evaluating the results. In that sense, picking a corner case with no practical utility in a domain will be less useful than a case that the majority cares about. Obviously this is domain-specific; not all people care about the same things, and not everyone has time to implement everything! As long as a corner case is not misrepresented as the common case but clearly noted as such, it is a valid contribution.

Matrix Multiplication is a simple example of an important problem. It's used extensively throughout machine learning, and while it has a simple naive implementation, it's still a topic of active performance research. Different use cases dictate what matrix sizes are important to test with. If you haven't, check out the Mojo documentation on  “Matrix Multiplication in Mojo” for an example of incremental optimizations and benchmarking.

2. Ensure correctness of results.

This one is self-explanatory; results must be compared to some “golden” reference output and match it exactly or be within a certain numerical threshold (if that applies for the specific domain). Performance is awesome, but getting wrong results faster is no use!

3. Make fair (apples-to-apples) comparisons.

One of the most important aspects of performance benchmarking when it pertains to comparison of different implementations is making sure comparisons are fair. This is a place where most discussions occur, as deviation from best practices can make one’s performance claims easy to dismiss. For faster results of a given implementation (the Mojo implementation in our case) to be meaningful, the comparison needs to be apples-to-apples.

  • Make sure you use equivalent optimization flags across implementations; even though flags (like -O3 in C) that enable multiple optimizations at once cannot always be equivalent to another language’s -O3, make sure you don’t compare something like a debug build with an implementation that uses the fast optimization flag.
  • Make sure that if one implementation has auto-vectorization or automatic multithreading enabled the same applies to all implementations to be compared (unless for a given language one of these performs worse when turned-on, in which case one could keep the fastest implementation for comparison purposes).
  • Use the latest (or best) combination of compilers, libraries, etc. — an older compiler version (for example) may perform better for whatever reason; however it should be considered sufficient to test with the latest stable version. One can test with older or experimental versions if they are so inclined.
  • Use the same input file (if applicable) or same input data. Avoid random data generation that may stress different code paths.
  • Use the same algorithm (if applicable) across all your implementations.
  • Use equivalent error testing as it applies to different domains’ best practices (e.g., input sanitizing, corner case testing).
  • Remove any unnecessary I/O (e.g., writing to file/screen for debug purposes) and keep only what is practically necessary — make sure you do so in a manner that code is not optimized out (see #6)!
  • Try to apply the same level of manual optimization (within reason) — if you write multi-threaded/vectorized code in Mojo, you should try to compare it to an equivalent implementation of the other language. There is a case to be made here, however, if the other language does not have such capabilities or they are so difficult to use that implementing them is beyond what one can reasonably do. This can highlight the programmability aspect of Mojo (or one language against another more generally), but this fact should be listed so that people can take the performance claims under this light.
  • Whether you employ auto-parallelization/auto-vectorization or do so manually, make sure all compared implementations use the same number of cores/threads/SIMD lanes, as applicable.
  • Ensure any system-level settings remain the same during your benchmarking and across implementations (e.g., pinning threads, NUMA configuration, any performance-related OS or hardware settings) (also see #8).
  • Double-check that you measure the same thing across implementations.

4. Measure the parts that matter. Equally.

This needs to be assessed on a case-by-case basis, and is highly application-dependent; a benchmark can oftentimes represent a standalone thing in the sense of functionality. In this case you may measure time/performance end-to-end. However, quite typically a benchmark reflects smaller (but crucial) parts of a larger application, where its input comes from another part of the application and/or output may be used by another part of the application.

  • Both aforementioned parts can be outside the scope of the benchmark, so input/output can be read/written from/to a file, input can be generated at run-time, output can be written on screen, etc. If the above are not actual parts of the application, do not include them in your performance measurements.
  • Similarly, do not measure your correctness test comparisons.
  • In any case, measure the same thing across all your implementations.

For example, the Modular MAX benchmark for Stable Diffusion measures only the UNet component of the model.

5. Make your benchmark reproducible.

The projects with the higher impact are the ones that can be validated and withstand the scrutiny of the community. The best way to enable this is to ensure you make it easy for anyone to reproduce your results. Obviously the level of detail can range from high-level to ridiculously low. The more in this case is typically the better and lends more credibility to the benchmark and results. Examples include:

  • Github code of all related implementations: if for instance you compare Mojo to Python and Rust, include implementations in all three languages — not just Mojo! Keep in mind the “apples-to-apples” parameters discussed in #3.
  • Specify how you compiled the program in each language (as applicable) — this includes specifying all the flags you used; ideally provide the exact commands used to build and/or provide build scripts.
  • Specify how you run the program in each language — specify all run parameters; as in the compilation case, ideally provide run scripts that automate runs.
  • If the nature of your application entails an input file/data make sure to include it directly or provide a link to download.
  • Along with your results, don’t forget to list the specs of the system you run your benchmark on — this includes CPU model, RAM, operating system & version, compiler version(s), related library version(s), any configuration that is non-default and that can affect performance.

To help with this, Mojo’s benchmark tool supports MLPerf Benchmark Scenarios, utilizing a common framework with community tested scenarios and results.

6. Make sure code is not optimized out.

Sometimes programmers come across performance that is so fast that it seems unreal! It may indeed be unreal if the compiler optimizes main parts of the code away. For example, if your code would normally spend seconds to calculate its output, it could appear that it takes almost zero time if one never uses the results.

  • Make sure your code is not optimized out by the compiler. Touch/use what you compute — whatever computation you (think) you do to calculate a value X could be a no-op if you don’t read X outside of the part where the computation takes place. Hopefully, you already print/write the value to ensure correctness, anyway, as discussed in #2) - you just don’t measure it.
  • Mojo provides the keep() and clobber_memory() functions in the benchmark package. These are both useful in benchmarking; the former ensures the compiler doesn’t delete code because the variable is not used (in a side-effecting manner). The latter ensures the compiler doesn’t optimize away memory writes (if it deems them to be unnecessary).

7. Collect results in a statistically correct & meaningful way.

We don’t mean to suggest that measurements need to be super-scientific, but some fundamentals should be followed (and inversely common pitfalls should be avoided). This is to account for inherent variability of computers.

  • Don’t run once and present that one result. This is misleading in most cases.
  • Run multiple times (5, 10, 100, depending on the specifics of your platform and the inherent variability of the benchmark).
  • You can obviously present multiple results using different metrics, but in the most basic case an average is good enough.
  • In many cases, it is suggested that you do a few warm-up runs before starting performance measurement.
  • Mojo provides the benchmark package in its stdlib that helps with all the above! Head over to the link to see how to measure a Mojo function of interest and print a meaningful performance report.

You can also see how we use the benchmark package for appropriate benchmarking in our Mojo examples: Matrix Multiplication in Mojo and Mandelbrot in Mojo.

8. Make your test system as deterministic as possible.

All computer systems entail sources that may introduce performance variability. When benchmarking your code, you need to ensure that all such sources are addressed accordingly. First and foremost, you shouldn’t be running other stuff on your system while benchmarking performance to minimize noise, so disable unneeded processes/services. Beyond that, here are some basic concerns that apply to most systems. While the details may differ slightly (e.g., settings for different CPU vendors) the concepts remain the same; we focus on three of the biggest culprits of performance variability and give examples on how to address them in the context of a Linux system below (a more exhaustive list can be found here). Besides the operating system level, similar settings can also be found in a system’s BIOS.

  • Disable frequency boosting: Turbo Boost is an Intel CPU feature that increases the CPU frequency above the normal maximum based on computation demands and power/thermal constraints. A similar technology in AMD CPUs is called Turbo-Core (or Core Performance Boost). Since such capabilities affect performance in a non-deterministic way, you need to disable it with:
bash
# On Intel CPUs echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
bash
# On AMD CPUs echo 0 > /sys/devices/system/cpu/cpufreq/boost
  • Disable Simultanous Multi-Threading (SMT): Modern CPUs support simultaneous multi-threading (hyper-threading in Intel terminology), where a single physical CPU core emulates two cores (in the typical implementation). While this may be advantageous for general CPU usage, it typically leads to resource contention, worse performance, and performance variability while benchmarking applications. In order to disable it you can run the following:
bash
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 >/sys/devices/system/cpu/cpu$cpunum/online done
  • Disable frequency scaling: Frequency scaling allows varying the CPU frequency in order to save power or increase performance, depending on the desired setting. Since we opt for performance, you need to select the performance governor that should run the CPU cores at the maximum frequency:
bash
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > $i done

9. Use an appropriate measurement metric.

For most purposes, reporting time is a safe and straightforward choice. However, if applicable in a certain domain, use the established metric. For example, when measuring LLM inference metrics of interest can include throughput, latency, time to first token and time per output token.

10. Be fair and well-intentioned!

This is obvious: one should not approach performance benchmarking with any form of bias, but in an ethical and well-intentioned way. We at Modular are confident in the power of Mojo, and excited about new features to come! We know the Mojicians community loves Mojo as much as we do, and we should all be respectful of all languages and their communities - which we already do! Different languages may excel at different use cases and more than one language has a place in a programmer’s arsenal. Let’s not forget that every new language (Mojo included) learns from and builds upon many of the languages that come before it.

As an example of this, consider the dialogue that followed between the Mojo and Rust communities, where a comparison between Mojo and Rust for DNA sequencing led to a larger discussion of how Mojo is designed with performance in mind.

Now you can confidently and proudly show off your Mojo project performance!

Performance claims (as much in programming as in every aspect of life) are historically contentious, because being faster is not a standalone quality, but entails something else being slower… and no one likes to be worse in any type of comparisons! In this blog post we outlined some basic performance benchmarking principles. We encourage all Mojicians to try and follow these principles when benchmarking their Mojo code versus implementations in other languages.

If you have any general questions or have specific questions on how the above may apply to your Mojo project, don’t hesitate to ask in our Discord channel. Our community is an awesome group of people from all programming backgrounds and levels, always happy to help!

Here are some additional resources to help get started. 

Until next time🔥!

Konstantinos Krommydas
,
Mojo Kernel Engineer