Gemma 4 just dropped on Modular, Day Zero! Read More →

March 11, 2026

Structured Mojo Kernels Part 2 - The Three Pillars

Fabio Riccardi

Modular Kernel Team

Engineering

In Part 1, we introduced the structured kernel architecture and made the case for why it exists. Conventional GPU kernels (where pipeline coordination, data movement, and compute interleave across thousands of lines) are difficult to reason about, maintain, and extend. Adding block-scaled quantization to a conventional kernel means writing a new kernel. Adding support for a new hardware feature means touching code that has nothing to do with that feature. Structured Mojo Kernels address this by separating concerns into three components with defined interfaces, and we showed the result: 48% less code, identical performance, and a conv2d kernel that reused the entire matmul infrastructure with only ~130 lines of conv-specific code.

This post explains the components of Structured Mojo Kernels: TileIO, TilePipeline, and TileOp. Each component forms a node in a kernel execution pipeline, and the links between them create a logical separation of concerns that makes kernels easier to extend and update. That organization matters because GPU kernels don't stay static. By abstracting hardware optimized implementations into patterns, the same kernel structure can adapt across NVIDIA and AMD hardware generations with minimal rewrite.

💡
Code: All kernels mentioned in this series are available in the Modular repository.

The Data Flow Model

A structured matmul kernel moves data through three distinct memory levels: global memory (DRAM), shared memory (SMEM/LDS), and registers or tensor memory. The three components form nodes in a data flow and are responsible for the different stages of execution. TileIO moves tiles from DRAM to shared memory. TilePipeline signals when tiles are ready to consume or free to refill. TileOp reads from shared memory, runs MMA instructions, and writes accumulators. This separation of concerns between the different components allows one to reason about each component in isolation.

Data Flow Model Diagram
Data Flow Model Diagram

TileIO: Data Movement Abstraction

TileIO handles all data movement between memory hierarchies. Its responsibility is to hide the complexity of platform-specific DMA operations behind a clean interface.

What TileIO Encapsulates

ConcernNVIDIAAMDEncapsulated By
DMA operationTMA async_copyload_to_ldsload()
Address calculationTMA descriptorsBuffer resourcesConstructor
Layout transformationTMA swizzle modesManual swizzleConfiguration
Masking/boundsTMA handlesManual checksload()

NVIDIA: TileLoaderTMA

On NVIDIA Blackwell, TileIO wraps TMA (Tensor Memory Accelerator, explained in a previous blog post) operations. The loader stores a pointer to the TMA descriptor and a multicast mask for cluster distribution:

mojo
struct TileLoaderTMA[
    tma_origin: ImmutOrigin,
    dtype: DType,
    gmem_layout: Layout,
    desc_layout: Layout,
    /,
    *,
    cta_group: Int,
](TrivialRegisterPassable):
    """TMA-based tile loader for Blackwell."""

    comptime TmaOp = TMATensorTile[Self.dtype, Self.gmem_layout, Self.desc_layout]

    # TMA descriptor pointer (referencing grid constant)
    var tma_op: Pointer[Self.TmaOp, Self.tma_origin]
    # Multicast mask for cluster distribution
    var multicast_mask: UInt16

    @always_inline
    def load[tile_layout: Layout, /, alignment: Int = 128](
        self,
        dest: SMemTile[Self.dtype, tile_layout, alignment=alignment],
        ref[AddressSpace.SHARED] barrier: SharedMemBarrier,
        k_coord: UInt,
        row_coord: UInt,
    ):
        """Load a tile asynchronously via TMA hardware."""
        self.tma_op[].async_multicast_load[Self.cta_group](
            dest, barrier, (k_coord, row_coord), self.multicast_mask
        )

The loader is deliberately minimal; it wraps one TMA descriptor and one multicast mask to expose a data movement primitive. The kernel manages all orchestration (k-group iteration, barrier management, expected byte counts), and interacts with the loader through the stage readiness/consumption protocol exposed by TilePipeline, which we'll discuss in "TilePipeline: The Pipeline Coordinator".

AMD: TileLoaderLDS

On AMD, TileIO uses buffer resources for cooperative global-to-LDS loading. The architecture requires a fundamentally different loading strategy:

mojo
struct TileLoaderLDS[
    dtype: DType,
    src_layout: Layout,
    src_tile_layout: Layout,
    num_loading_warps: Int,
    swizzle: Optional[Swizzle] = Optional[Swizzle](),
    load_width: Int = simd_width_of[dtype](),
    use_full_tile_width: Bool = False,
](TrivialRegisterPassable):
    """Cooperative global→LDS tile loader with swizzle support.

    Loading Modes (controlled by use_full_tile_width):
    - False: Interleaved layout for BF16 where MMA_K < BK
    - True: Row-major layout for FP8 where MMA_K == BK
    """

    var buffer: AMDBufferResource
    var thread_row: Int
    var thread_col: Int
    var warp_id: Int

    # Each warp loads a portion of the tile cooperatively
    # AMDBufferResource provides automatic out-of-bounds clamping to zero

Despite the completely different mechanisms, the abstraction allows users of the API to avoid the hardware details, and the call site for both looks the same.


TilePipeline: The Pipeline Coordinator

TilePipeline manages the producer-consumer pipeline that overlaps loading, computing, and storing. It does not move data and does not run MMA instructions, instead it just signals when tiles are full or empty.

Sequence diagram of the producer-consumer pipeline
Sequence diagram of the producer-consumer pipeline

NVIDIA: InputTilePipeline

On NVIDIA Blackwell, TilePipeline uses hardware mbarrier primitives. mbarrier provides hardware-accelerated arrive/wait, automatic byte counting that integrates with TMA, and phase tracking across pipeline stages. Parameterizing the pipeline by payload type means the same synchronization logic works for both standard and block-scaled tile configurations; only the payload differs:

mojo
struct InputTilePipeline[
    Payload: TilePayload,
    num_group_stages: Int,
    k_group_size: Int,
](TrivialRegisterPassable):
    """Tile pipeline with configurable payload type.

    Separates synchronization from tile storage. The Payload parameter
    (e.g., StandardTilePayload or BlockScaledTilePayload) holds tile arrays.
    """

    comptime Pipeline = ProducerConsumerPipeline[Self.num_group_stages]

    var pipeline: Self.Pipeline
    var payload: Self.Payload

    @always_inline
    def __init__(out self, barriers: Self.BarrierArray, payload: Self.Payload):
        self.pipeline = {barriers.ptr}
        self.payload = payload

    def producer(ref [origin]self) -> InputProducer[...]:
        """Get producer role handle."""
        return InputProducer(pipeline_ptr=Pointer(to=self))

    def consumer(ref [origin]self) -> InputConsumer[...]:
        """Get consumer role handle."""
        return InputConsumer(pipeline_ptr=Pointer(to=self))

The InputProducer and InputConsumer role handles provide acquire() methods that return context managers. The with producer.acquire() as tiles: pattern automatically handles barrier waits on entry and stage advancement on exit.

AMD: Atomic Counter Synchronization

AMD lacks the mbarrier operation. As a consequence, kernels on AMD cannot offload barrier byte-counting to hardware, instead all synchronization is explicit. The design uses a SyncStrategy trait to make the ring buffer logic independent of the synchronization mechanism:

mojo
trait SyncStrategy(TrivialRegisterPassable):
    """Interface for producer-consumer synchronization protocols."""

    def wait_producer_acquire(self, tile_idx: Int, stage: Int, phase: Int32): ...
    def signal_producer_release(mut self, tile_idx: Int, stage: Int): ...
    def wait_consumer_acquire(self, tile_idx: Int, stage: Int, phase: Int32): ...
    def signal_consumer_release(mut self, tile_idx: Int, stage: Int): ...

Two concrete implementations trade off simplicity for reduced contention:

mojo
struct SingleCounterSync[...](SyncStrategy):
    """One atomic counter per tile. Simpler but higher contention."""
    var sync_counter: SMemArray[Int32, total_tiles]

struct SplitCounterSync[...](SyncStrategy):
    """Separate producer/consumer counters. Reduces contention."""
    var producer_counters: SMemArray[Int32, total_tiles]
    var consumer_counters: SMemArray[Int32, total_tiles]

The underlying synchronization primitives are worth examining directly:

mojo
@always_inline
def wait_for_counter(
    counter: UnsafePointer[Int32, address_space = AddressSpace.SHARED],
    threshold: Int32,
):
    """Spin-wait until counter reaches threshold."""
    while Atomic.load(counter) < threshold:
        inlined_assembly["s_sleep 0", ...]()

@always_inline
def increment_counter_if_first_thread(
    counter: UnsafePointer[Int32, address_space = AddressSpace.SHARED],
    increment: Int32,
):
    """Atomically increment counter, but only from first thread in warp."""
    if thread_idx.x % WARP_SIZE == 0:
        _ = Atomic.fetch_add(counter, increment)

Two non-obvious decisions here. First, only the first thread in each warp increments the counter. All threads in a warp execute in lockstep, so having all threads in a warp (64 on AMD) issue fetch_add would produce 64 increments where one is correct. Second, s_sleep 0 yields the warp scheduler rather than burning compute cycles in a tight spin. On AMD, issuing a zero-cycle sleep hint lets other wavefronts make progress while this one waits.

Despite the completely different primitives, both NVIDIA and AMD TilePipeline expose the same acquire/release interface. The ring buffer logic above it is platform-agnostic.


TileOp: The Compute Driver

TileOp encapsulates all matrix computation: loading operands from shared memory to registers (or TMEM), executing MMA/mfma instructions, and managing accumulator storage.

NVIDIA Blackwell: TMEM-Based Accumulator

Blackwell introduces Tensor Memory (TMEM, explained in a previous blog post), a 256KB per-SM scratchpad dedicated to accumulator storage. This matters because moving accumulators out of the register file frees it for other uses and enables larger tile sizes. TileOp manages the full TMEM lifecycle through a context manager:

mojo
struct MmaWarpContext[
    opc: OutputPipelineConfig,
    mma_threads: Int,
    epilogue_threads: Int,
](TrivialRegisterPassable):
    """MMA warp context - owns TMEM lifecycle and output pipeline.

    __enter__: Signals epilogue that TMEM is allocated
    __exit__: Waits for epilogue, deallocates TMEM
    """

    comptime Tmem = TmemAllocation[Self.opc.cta_group]
    comptime Pipeline = OutputTilePipeline[Self.opc]
    comptime Dealloc = TmemDeallocBarrier[Self.opc.cta_group]

    var tmem: Self.Tmem
    var output_pipeline: Self.Pipeline
    var dealloc_barrier: Self.Dealloc

    @always_inline
    def __enter__(self) -> Self:
        Self.Sync.arrive()  # Signal epilogue that TMEM is ready
        return self

    @always_inline
    def __exit__(self):
        self.dealloc_barrier.complete_dealloc(self.tmem)

The context manager matters beyond tidiness. TMEM must be deallocated only after the epilogue warp has finished reading accumulators. If the MMA warp deallocates early, the epilogue reads garbage. If it never deallocates, the next kernel invocation fails to allocate.

As part of the pipeline, the MMA operation itself receives tiles from the pipeline and executes tcgen05.mma instructions, writing results to TMEM accumulators.

AMD MI300X: Register-Based Accumulator

On AMD, there is no dedicated tensor memory. Accumulators live in VGPRs alongside operands, and the compiler manages their allocation. As a result, we define the TileOperator as:

mojo
struct AmdTileOperator[
    InType: DType,
    OutType: DType,
    mma_shape: IndexList[3],
    ...
](TrivialRegisterPassable):
    """Matrix operations on AMD using mfma instructions."""

    var _a_reg_tile: Self.ARegTileType
    var _b_reg_tile: Self.BRegTileType
    var out_reg_tile: Self.OutRegTileType  # Accumulator in registers

    def load_tile_fragment[k_tile_idx: Int](
        self,
        smem_tile_a: TileTensor,
        smem_tile_b: TileTensor,
    ):
        """Load operands from LDS to registers with swizzle handling."""
        comptime for i in range(num_fragments):
            self._a_reg_tile[i] = smem_tile_a.load[swizzle](i, k_tile_idx)
            self._b_reg_tile[i] = smem_tile_b.load[swizzle](i, k_tile_idx)

    def mma_compute[k_tile_idx: Int](self):
        """Execute mfma instruction."""
        self.out_reg_tile = mfma[mma_shape](
            self._a_reg_tile, self._b_reg_tile, self.out_reg_tile,
        )

The Key Difference

AspectNVIDIA BlackwellAMD MI300X
Accumulator storageTensor Memory (256KB)Registers (VGPRs)
Instructiontcgen05.mmamfma
Operand sourceDirect from SMEMMust load to registers first
AllocationExplicit allocate/deallocateCompiler-managed

Both TileOp implementations present the same conceptual interface (load fragments, compute MMA, store results) but the resource lifecycle is structurally different. NVIDIA's context manager requirement is a direct consequence of explicit TMEM allocation. AMD needs no such wrapper because the compiler handles register lifetimes automatically.


TileTensor: The Tensor Datastructure

All three components share a common foundation: TileTensor. This abstraction carries compile-time layout information (shape, stride, swizzle patterns), memory address space placement (global memory, shared memory, registers), and type-safe access patterns. TileTensor is what allows TileIO, TilePipeline, and TileOp to hand tiles to each other without any component knowing the other's internal representation.

Example: Thread-to-Element Mapping

mojo
# Map 64 threads to a 16x64 output tile
comptime thread_layout = Layout.row_major(8, 4)

# Each thread handles 2 consecutive elements
var my_elements = output_tile
    .vectorize[1, 2]()                      # View as 2-element vectors
    .distribute[thread_layout](lane_id())   # My portion

# Write my portion
comptime for i in range(my_elements.shape[0]()):
    for j in range(my_elements.shape[1]()):
        my_elements[i, j] = accumulator[i * stride + j]

Putting Everything Together: A Complete Pipeline

Here's how the three components interact in the actual Blackwell structured matmul kernel. Each warp role has one responsibility, and the pipeline structure is explicit in the code.

Setup: shared memory and pipeline initialization

mojo
# From the Blackwell structured matmul kernel's run() method

# Setup: shared memory, pipeline, and context
ref smem = external_memory[...].bitcast[Self.SmemType]()[]
var tile_payload = Self.TilePayload(smem.a_tiles(), smem.b_tiles())
var input_pipeline = Self.InputTilePipeline(
    smem.pipelines.input_barriers(), tile_payload)
var ctx = Self.Context(smem.pipelines.tmem_addr())
var work_iter = scheduler.work_iterator()

TMA Load Warp: data movement only

The load warp acquires producer stages, issues TMA loads, and signals completion. It does not touch accumulators or barriers beyond what TilePipeline exposes.

mojo
# TMA Load Warp: data movement specialist
if WarpRole.is_main_load():
    with input_pipeline.producer() as producer:
        while work_iter.has_work():
            with work_iter.next() as current:
                for i in range(0, num_iters, Self.config.k_group_size):
                    with producer.acquire() as tiles:
                        Self.load_input_tiles(a_tma_op, b_tma_op, tiles, ...)
                syncwarp()
        producer.drain()

MMA Warp: compute only

The MMA warp waits on the input pipeline, computes, and signals the output pipeline. mma_ctx owns the TMEM allocation for the lifetime of this block: entering the context signals the epilogue that accumulators are allocated, and exiting the context deallocates only after the epilogue signals it has finished reading.

mojo
# MMA Warp: compute specialist
if WarpRole.is_mma():
    var mma_ctx = Self.MmaCtx.create(...)
    with mma_ctx:
        while work_iter.has_work():
            with work_iter.wait_and_advance():
                if ctx.elect_one_cta:
                    with mma_ctx.output_pipeline.producer() as output_stage:
                        with input_pipeline.consumer() as consumer:
                            for i in range(0, num_iters, Self.config.k_group_size):
                                with consumer.acquire() as input_tiles:
                                    Self.mma(output_stage.tmem, input_tiles, mma_op, ...)

Epilogue Warp: output only

The epilogue warp reads from TMEM through the output pipeline and writes to global memory. It signals the MMA warp when it is done, triggering TMEM deallocation.

mojo
# Epilogue Warp: output specialist
if WarpRole.is_epilogue():
    var epi_ctx = Self.EpilogueCtx(...)
    with epi_ctx:
        while work_iter.has_work():
            with work_iter.next() as current:
                with epi_ctx.output_pipeline.consumer() as output_stage:
                    tile_writer.write_batched(smem.c_tiles(), output_stage, ...)

What’s Next

We've now seen each pillar from the inside: TileIO moves data, TilePipeline coordinates timing, TileOp does the compute. In Part 3, we will explore the practical benefit of this modular design. We take two real kernel families, conv2d and block-scaled matmul, and trace exactly how they are built around the matmul foundation.


TL;DR

  • TileIO owns data movement: hides TMA vs. buffer load differences behind a single load() call
  • TilePipeline owns coordination: mbarrier on NVIDIA, atomic counters on AMD, same acquire/release interface
  • TileOp owns computation: TMEM on NVIDIA (with explicit RAII lifecycle), registers on AMD (compiler-managed)
  • TileTensor is the shared data abstraction: one type across all memory levels and address spaces
  • Component boundaries are enforced by types, not convention: the load warp cannot issue MMA instructions because TilePipeline does not expose that interface


Read more from Modular

View all blogs

Build the future of AI with Modular

View Editions
  • Person with blonde hair using a laptop with an Apple logo.

    Sign up today

    Signup to our Cloud Platform today to get started easily.

    Sign Up
  • Magnifying glass emoji with black handle and round clear lens.

    Browse open models

    Browse our model catalog, or deploy your own custom model

    Browse models
No items found.