January 29, 2024

What is loop unrolling? How you can speed up MojošŸ”„ code with @unroll

Open any introductory programming book and youā€™ll find several pages dedicated to structured programming concepts, i.e. making use of loops, conditions and functions extensively for better clarity and maintainability for your code. It helps you express your ideas and solutions neatly and elegantly in code. However, these benefits come at a cost: performance overhead.Ā 

Consider a simple for-loop: each iteration involves condition checks and structural setup, adding to the computational workload. This is especially true for loops with a tiny body. While these overheads usually outweigh the ease-of-use and clarity of code benefits for most use-cases, in some performance-critical situations we might want to trade-off the structural elegance to gain faster performance. Loop unrolling is one such procedure and in this blog post weā€™ll discuss how you can unroll MojošŸ”„ loops to speed up code.

What is loop unrolling?

Loop unrolling is a loop transformation technique that aims at improving the performance of a program at the expense of code size. When a loop is unrolled, instructions that control the loop are eliminated (full unroll) or reduced (by unroll factor) and replaced with multiple copies of the loop body, thereby reducing the computational overhead but increasing the size of the code. A programmer can manually unroll a loop by hand coding the unrolled loop or provide special instructions in code so the Mojo compiler can do it for you. In this blog post, weā€™ll explain how loop unrolling works and how to do it in Mojo.

Understanding loop unrolling

Typical computer programs include both straight-line code and loops for repeated execution of code until a specific condition is met. For any non-trivial program, the percentage of time that is spent executing loops is longer than it is to execute the straight line code. While straight line code is typically executed only once, loops may have to be executed tens, hundreds, or thousands of times or more.Ā 

There are many compiler optimizations techniques to speed up execution of loops and loop unrolling is one such compiler optimization technique. Letā€™s take an example of a simple Mojo loop that prints values from 0 to 4:

Mojo
for i in range(5): print(i)

When compiled, the CPU sees the following pseudo assembly:

Mojo
i = 0 exit_loop = i < 5 loop_body: print(i) i += 1 exit_loop = i < 5 if exit_loop: jump loop_body end_loop: ...

In the code above you can see that for every loop iteration there is arithmetic operation i+=1 one comparison operation i<5 and one conditional jump operation jump loop_body. This is wasteful, since the ratio between useful instructions: print(i) and instructions needed just to control the loop is high.

What if we can adjust that ratio? At compile time, the compiler knows the loop bounds, it could replace the entire loop with unrolled code (this is called full unrolling). In Mojo you can use the @unroll decorator to instruct the compiler to unroll it the loop as follows:

Mojo
@unroll for i in range(5): print(i)

Would generate the following pseudo assembly:

Mojo
print(0) print(1) print(2) print(3) print(4)

Mojo unrolls the loop and replaces 2 lines of loop code with 5 lines of repeated loop body instruction, increasing the overall code size but removing loop control code since there is no loop overhead anymore. Letā€™s dive deeper into Mojo APIs for loop unrolling.

MojošŸ”„ APIs for loop unrolling

MojošŸ”„ offers first class support for manual loop unrolling. In the earlier example, we saw Mojoā€™s @unroll decorator which lets you unroll loops without having to manually write the unrolled version. In Mojo, loop unrolling syntax is part of the core language. We made the very conscious decision that programmers should have more control on the compilerā€™s behavior.Ā 

Traditional compilers such as clang, gcc and others also support automatic loop unrolling that aims to expand the loop and sacrifice code size for more branch free instruction generation. But these are compiler options and outside the control of the programmer. It is often considered ā€œcompiler magicā€ since it is part of black-box compiler optimizations. As with any out-of-the box optimization, they work very well for specific cases and donā€™t work for others and the programmer has no control over it. See Chris Lattnerā€™s CGO2012ā€™s keynote on "Compiler Optimization: Increasing Research Impactā€ for a deeper discussion on this topic.

Mojoā€™s design philosophy is to give mojo programmers more explicit and predictable control on what the compiler does. For example Mojoā€™s @unroll and @unroll(n) are guaranteed syntax that if compilation passes, the decorated loops are guaranteed to be unrolled; if the compiler cannot unroll the loop, compilation fails and tells the user why the loop couldnā€™t be unrolled. The unroll syntax is not just a hint to the compiler, but gives the programmer fine grained control of this compiler behavior. Mojo is still young and evolving rapidly, and over time unroll will provide more control over different optimization levels. If you have feature requests weā€™re all ears, please file requests on GitHub.

MojošŸ”„ supports 3 language features for loop unrolling @unroll, @unroll(n) and unroll[]() function. Letā€™s take a look at an example for each.

@unroll: full loop unrolling

The simplest way to unroll is to use the @unroll decorator which fully unrolls the loop. This is easy and straightforward to use, but increases program code size based on the total number of iterations in the loop. This can be undesirable, particularly for applications where youā€™re trying to keep binary sizes low.

Mojo
@unroll for i in range (0, 3, 1): foo(i) bar(i) # compiler unrolls the loop into something equivalent to the following: foo(0) bar(0) foo(1) bar(1) foo(2) bar(2)

@unroll(n): with specified unroll factor n

In the previous example, we did a full loop unroll, i.e. we eliminated the loop completely. In practice, loop bounds can be too large for full unroll to provide performance gains, so itā€™s efficient to specify an unroll factor instead. Small unrolling factors like 2 or 4 typically see speed improvements and Mojo lets you specify the unroll factor n using @unroll(n) loop decorator. Specifying an unroll factor affords a good balance between the benefit of loop unrolling and potential code size explosion.

The factor by which to unroll can be specified based on potential speed gains against code size increase. For example an unroll by factor 4 looks like this:

Mojo
@unroll(4) for i in range (0, 10, 1): foo(i) bar(i)

Unlike the full unroll in the previous example, this wonā€™t eliminate the loop completely, but would loop fewer times and update the loop body with code repetitions based on unroll factor.

The CPU would see pseudo assembly that looks like this:

Mojo
# compiler unrolls the loop into something equivalent to the following: for i in range (0, 8, 4): # loop body with 4 unrolled iterations. foo(i) bar(i) foo(i + 1) bar(i + 1) foo(i + 2) bar(i + 2) foo(i + 3) bar(i + 3) for i in range (8, 10, 1): # loop body the same as the original loop # to handle the remainder iterations. foo(i) bar(i)

Currently, @unroll and @unroll(n) can only be added to loops with a constant number of iterations that the compiler can infer statically. There is also a restriction on the loops that they need to be ā€œsimple loopsā€, those of which donā€™t have multiple exits in the loop body. We will add more sophisticated support for loop unrolling in the near future and we always welcome feature requests on GitHub.

unroll(): repeatedly evaluate a function

In addition to @unroll decorators, Mojo also offers an unroll() function in the standard library. Rather than unroll a loop, unroll[]() repeatedly evaluates a function parameterized by 1, 2 or 3 integer parameters.Ā 

Mojo
fn unroll[count: Int, func: fn[Int]() capturing -> None]() fn unroll[dim0: Int, dim1: Int, func: fn[Int, Int]() capturing -> None]() fn unroll[dim0: Int, dim1: Int, dim2: Int, func: fn[Int, Int, Int]() capturing -> None]()

This is equivalent to 1, 2 or 3 nested loop with the body of the loop defined in a function func, passed as a compile time parameter to unroll()

Peek under the hood: Performance vs. executable size trade-off

Letā€™s see another simple example of loop unrolling, to analyze the trade-offs between program performance and its executable size.

Mojo
# loop.mojo fn foo(i: Int): print(i * 2) fn bar(i: Int): print(i * 2 + 1) # a for-loop fn main(): for i in range (0, 6, 1): foo(i) bar(i)

The above program loops over different iteration values of [0, 1, 2, 3] and calls foo with these values in each iteration. The output of the program is:

Output
0 1 2 3 4 5 6 7 8 9 10 11

We can also write this program below without using a loop:

Mojo
# no-loop.mojo fn foo(i: int): print(2*i) fn bar(i: int): print(2*i + 1) # manually unroll the for-loop foo(0) bar(0) foo(1) bar(1) foo(2) bar(2) foo(3) bar(3) foo(4) bar(4) foo(5) bar(5)

Comparing these two programs:

  • no-loop.mojo is functionally the same as loop.mojo.
  • no-loop.mojo uses 12 lines of code to accomplish the same thing as loop.mojo which only uses 3 lines of code. If the loop body contains many lines of code L, unrolling the loop will create N * L lines of code if we unroll the loop N times.
  • loop.mojo has a control flow of for in the program , while no-loop.mojo does not have control flow. Control flow requires extra work during execution time, e.g. for needs to check the condition whether the iteration is still in the range or not, and choose either entering the loop body or going to exit the loop.

Manually unrolling a loop is quite tedious when the loop body is large and it can also lead to significant lines of code increase in the program. Unrolling the loop during compilation can avoid tedious manual editing of the program with no source code size explosion.

Now, letā€™s take a look at what happens if we do loop unrolling in the compiler i.e. no need for the programmer to manually expand the source code of the program.

Here is the LLVM IR for the loop version:Ā 

LLVM IR
// (pseudo) LLVM IR with loop define i32 @main(i32 %0, ptr %1) { br label %3 3: ; preds = %8, %2 %4 = phi i64 [ 0, %2 ], [ %9, %8 ] %5 = icmp slt i64 %4, 6 br i1 %5, label %6, label %7 6: ; preds = %3 br label %8 7: ; preds = %3 br label %12 8: ; preds = %6 %9 = add i64 %4, 1 %10 = mul i64 %4, 2 call void @"$stdlib::print"(i64 %10) %11 = add i64 %10, 1 call void @"$stdlib::print"(i64 %11) br label %3 12: ; preds = %7 ret i32 0 }

In this loop version:

  • There are multiple basic blocks [2] (3:, 6:, 7:, 8:, :12) and control flow (br label)
  • Branching can occur at an extra cost during execution in many computer architectures šŸ˜ž
  • The IR size will stay (almost) the same if loop iteration increases so that the generated binary executable size stays the same. šŸ˜Œ

Where as LLVM IR for the unrolled the loop:

LLVM IR
// (psuedo) LLVM IR with unrolled loop define i32 @main(i32 %0, ptr %1) { call void @"$stdlib::print"(i64 0) call void @"$stdlib::print"(i64 1) call void @"$stdlib::print"(i64 2) call void @"$stdlib::print"(i64 3) call void @"$stdlib::print"(i64 4) call void @"$stdlib::print"(i64 5) call void @"$stdlib::print"(i64 6) call void @"$stdlib::print"(i64 7) call void @"$stdlib::print"(i64 8) call void @"$stdlib::print"(i64 9) call void @"$stdlib::print"(i64 10) call void @"$stdlib::print"(i64 11) ret i32 0 }

If the compiler unrolls the loop automatically, in this case, the generated IR:

  • Does not have control flow anymore with only one basic block.
  • Avoid branching cost during execution with a single basic block. Open more possibilities to other compiler optimizations with straight-line code. šŸ˜Œ
  • The IR size will grow proportional to the number of loop iterations which leads to binary executable size increase. šŸ˜ž

Conclusion

Loop unrolling is a simple technique used to optimize a program's execution speed by reducing or eliminating loop control instructions and loop tests. Mojo gives the programmer explicit and fine-grained control on how unrolling should be done on loops. This empowers the programmer to determine the most suitable unrolling factor for a given situation, based on their performance vs. increased code size trade off appetite.Ā 

We hope you enjoyed this explainer, download Mojo, try it out and share your feedback with us! Here are some additional resources to get started:

Until next timešŸ”„!

Weiwei Chen
,
AI Compiler Engineer
Shashank Prasanna
,
AI Developer Advocate

Weiwei Chen

AI Compiler Engineer

Weiwei is interested in building compilers to get the uttermost performance out of computational systems. Before Modular, she has built several compilers from scratch, including MLIR-based compilers and a DSL for dataflow architecture at SambaNova Systems; and a Scala-based compiler for BigData acceleration on FPGAs at BigStream Solutions. She has also worked on concurrent programming library at Qualcomm Research Silicon Valley, and parallel simulation for her graduate study at UC Irvine. Weiwei enjoys playing the fiddle, reading high fantasy and science fiction books, going hiking and folk dancing in New England.

Shashank Prasanna

AI Developer Advocate

Shashank is an engineer, educator and doodler. He writes and talks about machine learning, specialized machine learning hardware (AI Accelerators) and AI Infrastructure in the cloud. He previously worked at Meta, AWS, NVIDIA, MathWorks (MATLAB) and Oracle in developer relations and marketing, product management, and software development roles and hold an M.S. in electrical engineering.