Modular: Inference from Kernel to Cloud

Your model, any compute, one platform. Run AI across GPUs and CPUs - engineered for the most demanding inference workloads, from kernel to cloud.

Get started

Talk to an AI Engineer

playground

LATEST MODEL 🚀

Try Modular right here 👇

Request API Token

Generate images and text (videos coming soon), then request access for a full account with API token. Request a demo to get priority access.

The Modular Platform

A unified AI inference platform for high-performance, portable compute -  enabling full optimizations from GPU kernel to API endpoint.

2x Performance on a unified stack
Our unified infrastructure optimizes your AI pipeline with full-stack optimizations across text, image and video.
AI on any GPU
Same model, same codebase - seamlessly runs across NVIDIA, AMD, Intel, ARM, and Apple Silicon. True hardware portability.
50% cost saving
Higher GPU utilization, faster model compilation & runtime, and dynamic hardware selection means savings compound at scale.

We’re built different.
A unified AI inference stack giving you total control.

Most AI infrastructure today is assembled from parts that were never designed to work together: one tool for serving, another for optimization, another for custom kernels, something else for scaling. Every layer you add is another place where things break.

We built Modular to fix that.

One unified stack from kernels to cloud, built from the ground up for heterogeneous compute. Scale AI from cloud to edge - CPUs, GPUs, and ASICs.

Cloud
Run AI workloads at production scale with SOTA performance on NVIDIA and AMD GPUs in Modular’s hosted cloud or your VPC. Our full-stack approach enables complete workload customization, performance tuning, and deep observability.  
Serving
Our high-performance, hardware-agnostic serving framework, MAX, automatically optimizes kernels and request execution across accelerators. 2x performance improvement over vLLM on diverse hardware through a single container and OpenAI-compatible API.
Modeling
Run 1000+ models like DeepSeek and Kimi out of the box with MAX. PyTorch-like model APIs and AI coding skills make it easy to port custom models in minutes.
GPU Kernels
100s of SOTA, composable kernels written in our high-performance systems language, Mojo. Extend or write custom GPU kernels for maximum performance across accelerators.
Hardware Compatibility
Modular was built to be natively heterogeneous. Run workloads seamlessly across NVIDIA, AMD, and Apple GPUs as well as Intel, AMD, and ARM CPUs.

Our Cloud, or Yours

MAX Serving APIs

MAX Modeling APIs

Mojo Kernels

Fast inference, deployed your way

Run top open models or your own custom models with flexible deployment — in our managed cloud or directly in your own VPC.

Inference Solutions

Shared endpoints
High-performance inference. No infrastructure to manage, no long-term commitments. Test easily. Per-token pricing.
Dedicated endpoints
Reserved NVIDIA and AMD GPUs. Per-minute pricing that’s easy and flexible.
Custom Models
Bring your own custom or fine-tuned models. Deploy on optimized infrastructure with per-minute pricing.

Get Started

  from openai import OpenAI
  
  client = OpenAI(
      base_url="https://deepseek-v31.{org_name}.api.modular.com/v1",
      api_key="MODULAR_API_KEY",
  )
  
  completion = client.chat.completions.create(
      model="deepseek/deepseek-chat-v3.1",
      messages=[
          {
            "role": "user",
            "content": "Who won the world series in 2020?"
          },
      ],
  )
  
  print(completion.choices[0].message.content)

Deployment solutions

Customer Stories

70%

Faster Time to first audio

Igor Poletaev

Chief Science Officer - Inworld

“Our collaboration with Modular is a glimpse into the future of accessible AI infrastructure.”

Read Story

<500ms

Time to first token (TTFT)

Keep every conversation instant. MAX delivers sub-second mean time to first token (TTFT. Patients get responsive, natural interactions with no perceptible delay.

Read Story

70%

Total cost savings

Darrick Horton

Co-founder & CEO

"We saved up to 70% with Modular, the fastest inference engine on AMD compute"

Read Story

15+

CPU+GPU Architectures

Bratin Saha

VP of Machine Learning & AI services

"The MAX Platform supercharges our mission for our millions of AWS customers, helping them bring the newest GenAI innovations and traditional AI use cases to market faster."

‍

Read Story

Top AI models, or your custom ones

Our forward-deployed engineers optimize every deployment for SOTA performance - whether you're running a top open model or a custom model.

Build with popular models
- Deepseek R1
  Frontier-class models (V3, R1) built for complex reasoning, coding, and math — at dramatically lower inference cost than comparable proprietary models.
  LLM
- KimiK2.5
  Moonshot AI's 1T parameter MoE model optimized for agentic tasks, tool use, and coding.
  LLM
  Vision
- MiniMax
  Large-scale MoE model (456B params) optimized for long-context tasks up to 1M tokens.
  LLM
- Your custom model
  Your models, your kernels, any hardware. Write once in MAX and deploy across GPUs & CPUs with no vendor lock-in.
Build by specific use case
- Coding Agent
  AI copilots, automated refactoring, test generation, and production-ready code synthesis.
- Image Generation
  Text-to-image creation, creative assets, design prototyping, and visual content workflows.
- Text to Audio
  Natural voice synthesis, multilingual narration, real-time speech, and audio generation
- Agentic
  Run faster AI agents anywhere with compiler-optimized inference across NVIDIA, AMD, and Apple Silicon.

Scale to diverse hardware, seamlessly

Avoid vendor lock-in and GPU scarcity when you can deploy on whatever hardware that delivers the best price-performance for your workload — without rewriting code.

Write once, deploy everywhere
Breakthrough compiler technology that automatically generates optimized kernels for any hardware target.
Vendor Independence
Break free from GPU vendor lock-in. Modular delivers peak performance across NVIDIA and AMD.

Custom models are now easy. Try it.

Our agentic tooling and forward-deployed engineers help you port and deploy your models quickly - so you can evaluate performance and start running production workloads immediately.

Talk to an AI Engineer

Try agent skills

Competitor Endpoints

(Other providers)

Easy to deploy
Fast setup and managed infrastructure

but...

Limited control
Generic optimizations
Vendor lock-in
NVIDIA Only

vs.

Self-Hosted Endpoints

(Other providers)

Maximum control
Custom kernels, full visibility, your hardware

but...

Significant operational overhead
Long setup and tuning cycles
You’re on your own

Managed simplicity + Self-hosted control. Pick both.

Modular eliminates the tradeoff, providing the simplicity of managed inference with engineering-level control.

Dedicated endpoints with predictable performance
Forward-deployed engineers optimizing your workloads
Compiler-level optimizations that fuse the entire inference graph
Custom kernel programmability in Mojo & Python
GPU portability across NVIDIA and AMD without rewriting code

Request a demo

No black boxes. No vendor lock-in. No operational burden.

Get started with Modular

Request a demo
Schedule a demo of Modular and explore a custom end-to-end deployment built around your models, hardware, and performance goals.
- Distributed, large-scale online inference endpoints
- Highest-performance to maximize ROI and latency
- Deploy in Modular cloud or your cloud
- View all features with a custom demo
Book a demo
Talk with our sales lead Jay!
30min demo. Evaluate with your workloads. Ask us anything.

Talk to us!
Book a demo for a personalized walkthrough of Modular in your environment. Learn how teams use it to simplify systems and tune performance at scale.
- Custom 30 min walkthrough of our platform
- Cover specific model or deployment needs
- Flexible pricing to fit your specific needs
Book a demo
Talk with our sales lead Jay!
Start using MAX
( FREE )
Run any open source model in 5 minutes, then benchmark it. Scale it to millions yourself (for free!).
Install MAX
What is MAX?
Start using Mojo
( FREE )
Install Mojo and get up and running in minutes. A simple install, familiar tooling, and clear docs make it easy to start writing code immediately.
Install Mojo🔥
What is Mojo🔥?

Developer Approved

amazing achievements

Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

12x faster without even trying

svpino

“Mojo destroys Python in speed. 12x faster without even trying. The future is bright!”

actually flies on the GPU

Sanika

"after wrestling with CUDA drivers for years, it felt surprisingly… smooth. No, really: for once I wasn’t battling obscure libstdc++ errors at midnight or re-compiling kernels to coax out speed. Instead, I got a peek at writing almost-Pythonic code that compiles down to something that actually flies on the GPU."

The future is bright!

mytechnotalent

Mojo destroys Python in speed. 12x faster without even trying. The future is bright!

easy to optimize

dorjeduck

“It’s fast which is awesome. And it’s easy. It’s not CUDA programming...easy to optimize.”

Community is incredible

benny.n

“The Community is incredible and so supportive. It’s awesome to be part of.”

one language all the way through

fnands

“Tired of the two language problem. I have one foot in the ML world and one foot in the geospatial world, and both struggle with the 'two-language' problem. Having Mojo - as one language all the way through is be awesome.”

high performance code

jeremyphoward

"Mojo is Python++. It will be, when complete, a strict superset of the Python language. But it also has additional functionality so we can write high performance code that takes advantage of modern accelerators."

performance is insane

drdude81

“I tried MAX builds last night, impressive indeed. I couldn't believe what I was seeing... performance is insane.”

was a breeze!

“Max installation on Mac M2 and running llama3 in (q6_k and q4_k) was a breeze! Thank you Modular team!”

surest bet for longterm

pagilgukey

“Mojo and the MAX Graph API are the surest bet for longterm multi-arch future-substrate NN compilation”

impressed

justin_76273

“The more I benchmark, the more impressed I am with the MAX Engine.”

pure iteration power

Jayesh

"This is about unlocking freedom for devs like me, no more vendor traps or rewrites, just pure iteration power. As someone working on challenging ML problems, this is a big thing."

impressive speed

Adalseno

"It worked like a charm, with impressive speed. Now my version is about twice as fast as Julia's (7 ms vs. 12 ms for a 10 million vector; 7 ms on the playground. I guess on my computer, it might be even faster). Amazing."

works across the stack

scrumtuous

“Mojo can replace the C programs too. It works across the stack. It’s not glue code. It’s the whole ecosystem.”

very excited

strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

feeling of superpowers

Aydyn

"Mojo gives me the feeling of superpowers. I did not expect it to outperform a well-known solution like llama.cpp."

huge increase in performance

Aydyn

"C is known for being as fast as assembly, but when we implemented the same logic on Mojo and used some of the out-of-the-box features, it showed a huge increase in performance... It was amazing."

potential to take over

svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

completely different ballgame

scrumtuous

“What @modular is doing with Mojo and the MaxPlatform is a completely different ballgame.”