Now serving MiniMax-M3! Request access today. Read More →

Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Illustration of a smiling astronaut and a cheerful orange flame character floating in front of a neon-lit triangular background.

Democratizing AI Compute Series

Go behind the scenes of the AI industry with Chris Lattner

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 3

Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.

June 5, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Engineering

Three trends from MLSys 2026

The shared conclusion of these talks was that agentic engineering requires substantially greater rigor in specification, design, and validation.

May 29, 2026

/

Michael Dunn-OConnor

,  

Brian Zhang

,  

Shouzheng Liu

,  

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 2

To route a request to the pod with the best cached prefix, you need to know which blocks are cached on which pod. That sounds simple until you look at the numbers. You may have hundreds of pods, each with thousands of cached blocks. State can change hundreds of times per second. Across this complexity, queries need to return in microseconds because they sit on the critical path of every inference request.

May 21, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Engineering

Why LLM Inference Needs a New Kind of Router - Part 1

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.

May 8, 2026

/

Aayush Deshpande

,  

Deep Dhillon

,  

Alexandr Nikitin

,  

Michael Dunn-OConnor

,  

🚨

News

Engineering

TileTensor Part 1 - Safer, More Efficient GPU Kernels

Suppose you want to load a 2D tile of a matrix, where the tile is stored in shared memory in a specific interleaved layout to avoid bank conflicts. This example uses a toy XOR swizzle to illustrate the class of bugs; real kernels use hardware- and layout-specific swizzles and vectorized accesses. Without a layout abstraction, here is how you would launch a kernel with a block size of (32,8):

April 13, 2026

/

Lukas Hermann

,  

🚨

News

Engineering

Structured Mojo Kernels Part 4 - Portability and the Road Ahead

GPU portability has a mixed track record. “Write once, run everywhere” usually means “write once, run slowly everywhere.” CUTLASS does not attempt portability beyond NVIDIA hardware and is usually limited within a generation of the hardware. Triton provides portability but performance degrades on non-NVIDIA targets. The conventional wisdom is that you have to choose between being portable or being fast.

April 3, 2026

/

Fabio Riccardi

,  

Modular Kernel Team

,  

🚨

News

Engineering

Software Pipelining for GPU Kernels: Part 1 - The Pipeline Problem

Flash Attention is a simple algorithm: tiled back-to-back matmuls with an online softmax algorithm in between. The algorithm fits in a few dozen lines of pseudocode. Yet Flash Attention 4's production kernel is 2,875 lines, and the hardest part to get right isn't the math. It's the async execution and pipelining synchronization, all hand-derived from a schedule that no standard debugging tool can verify.

March 30, 2026

/

Yingbo Ma

,  

🚨

News

Engineering

Structured Mojo Kernels Part 3 - Composition in Practice

This post shows the practical benefit of this modular design. We take two real kernel families, conv2d and block-scaled matmul, and trace exactly how they are built around the matmul foundation. In both cases, a new kernel family requires changing one component while leaving the rest untouched. The conv2d kernel adds roughly 130 lines of new code, whileBlock-scaled matmul adds roughly 200 with no performance degradation.

March 26, 2026

/

Fabio Riccardi

,  

Modular Kernel Team

,  

🚨

News

Engineering

Structured Mojo Kernels Part 2 - The Three Pillars

This post explains the components of Structured Mojo Kernels: TileIO, TilePipeline, and TileOp. Each component forms a node in a kernel execution pipeline, and the links between them create a logical separation of concerns that makes kernels easier to extend and update. That organization matters because GPU kernels don't stay static. By abstracting hardware optimized implementations into patterns, the same kernel structure can adapt across NVIDIA and AMD hardware generations with minimal rewrite.

March 11, 2026

/

Fabio Riccardi

,  

Modular Kernel Team

,  

🚨

News

Engineering

Structured Mojo Kernels Part 1 - Peak Performance, Half the Code

GPU programming has always demanded precision, but the cost of that precision keeps rising. A production matmul kernel written in C++ spans 3,000–5,000 lines of tightly coupled code where a misplaced barrier silently corrupts results. That complexity gatekeeps hardware that should be available to far more developers, and it's a direct product of how GPUs have evolved: with each architecture generation, more of the orchestration burden has shifted onto the programmer.

March 5, 2026

/

Fabio Riccardi

,  

Modular Kernel Team

,  

  • Series

    Democratizing Compute

    Go behind the scenes of the AI industry in this blog series by Chris Lattner. Trace the evolution of AI compute, dissect its current challenges, and discover how Modular is raising the bar with the world’s most open inference stack.

    11 part series

  • Series

    Matrix Multiplication on Blackwell

    Learn how to write a high-performance GPU kernel on Blackwell that offers performance competitive to that of NVIDIA's cuBLAS implementation while leveraging Mojo's special features to make the kernel as simple as possible.

    4 part series

  • Series

    Structured Mojo Kernels

    Learn how Mojo simplifies GPU programming with modular kernel architecture, compile-time abstractions, and zero-cost performance across modern GPU hardware.

    4 part series

  • Series

    Software Pipelining for GPU Kernels

    Explore software pipelining for GPU kernels from first principles. We formalize dependencies as a graph, solve for the optimal schedule with a constraint solver, and show how it all integrates into MAX via pure Mojo.

    1 part series

  • Series

    Why LLM Inference Needs a New Kind of Router

    This series walks through why traditional HTTP routing breaks down under LLM workloads and how Modular Cloud solves it with a three-layer architecture built for cache-aware routing.

    2 part series

  • Series

    TileTensor

    This series walks through how Modular built TileTensor, a Mojo tensor type that lets kernel authors express complex memory layouts precisely, safely, and efficiently.

    1 part series

No items found within this category

We couldn’t find anything. Try changing or resetting your filters.

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models