FLUX.2 Inference: Sub-Second Image Generation on NVIDIA & AMD
Deploy images at scale. 4x faster than PyTorch
The most capable FLUX model, compiled and optimized with MAX for sub-second image generation. Reduced computational overhead, portable across NVIDIA and AMD GPUs, and seamless scaling on Modular Cloud.
Flux.2 at a glance:
Generation time
<1s
vs torch.compile
4.1x
per image on AMD
$0.009
GPU vendors
2
Generate images and videos in seconds
Test out our image generation quality, then request access to a full account. Video generation coming soon. Request a demo to get priority access.
Compiled, not wrapped
Modular’s engine (MAX) compiles the full diffusion pipeline (DIT, VAE, text encoder, scheduler) into a single fused execution graph through MLIR. No PyTorch runtime. No ComfyUI shim. Every denoising step optimized together.
Resolution | MAX on B200 | torch.compile on B200 | Speedup on B200 | MAX on MI355X | Speedup on MI355X |
|---|---|---|---|---|---|
1024 x 1024 | 3.3s | 13.3s | 4.1x | 3.4s | 3.8x |
768 x 1360 | 2.8s | 10.7s | 3.8x | 3.0s | 4.0x |
1360 x 768 | 2.4s | 8.2s | 3.4x | 2.5s | 3.6x |
5.5x total cost advantage on AMD
4.1x from compiled performance. 1.33x from lower hardware cost. The economics stack.
- Lowest Cost$0.009 / imageMI355X with sub second generation
MAX on AMD MI355X - 1024 x 1024
- $0.011 / imageB200 with sub second generation
MAX on NVIDIA B200 - 1024 x 1024
Provider / Setup | Cost per image (1024 x 1024) | vs. Modular on AMD |
|---|---|---|
Nano Banana Pro | $0.134 | 15x more expensive |
fal.ai (FLUX.2 Dev) | $0.012 | 1.3x more expensive |
MAX on B200 | $0.011 | 1.2x more expensive |
MAX on MI355X | $0.009 | Reference |
fal.ai pricing: $0.012/megapixel for FLUX.2 (dev) (1MP - 1024 x 1024) Source: fal.ai/models/fal-ai/flux2. Modular pricing reflects compute cost with BFL commercial license.
Same quality. 4x the speed.
Identical model weights. Identical output quality. The performance comes from compilatio, not approximation.

- 4.1x faster
1024 x 1024 - 50 steps - Diffusers with torch.compile1x baseline
1024 x 1024 - 50 steps
- Top: MAX(~1.0s)
- Bottom: torch.compile (~4s)
Same prompt, same model weights. Quality tolerance is configurable — trade fractions of perceptual quality for sub second generation.
The only FLUX.2 endpoint on NVIDIA and AMD
Every other FLUX.2 provider is locked to NVIDIA. MAX compiles diffusion models natively for both vendors from the same container.
- Cost arbitrage
AMD spot pricing runs 25-40% lower. Shift batch generation workloads without changing a line of code
- Supply flexibility
GPU availability fluctuates. Two vendors means you’re never capacity-constrained by one.
- Future-proof
MAX targets hardware at the instruction level. No CUDA dependency. New silicon support ships in days
Full pipeline, one compiled graph
Diffusion inference isn’t a single model call, it’s dozens of denoising steps plus decoding, encoding, and scheduling. Modular’s engine (MAX) compiles all of it together.
- Text EncoderPrompt embedding
- DITDenoising x N stops
- SchedulerStep optimization
- VAE DecodeLatent > Pixel
- Image<1s total
Other platforms run each stage separately through PyTorch, with memory round-trips between them. MAX eliminates that overhead by fusing the entire pipeline into one MLIR-compiled graph. Built with custom Mojo kernels, not a PyTorch wrapper.
OpenAI-compatible.
Drop-in replacement.
Change the base URL. That’s it.
Model details

- DeveloperBlack Forest Labs
- PrecisionBF16
- APIOpenAI compatible
- HardwareNVIDIA, AMD, Apple Silicon
- Licensedeveloper
- Parameters32B
- ModalityText > Image, Image Editing
- DeploymentShared, Dedicated, Self Hosted
- Container<700MB (90% smaller than vLLM)
FAQ
How fast is FLUX.2 on Modular?
Sub-second at 1024x1024 on NVIDIA B200. That’s 4.1x faster than torch.compile on the same hardware. On AMD MI355X, generation time is within 4% of B200.
How much does FLUX.2 cost on Modular
Compute cost starts at $0.001/image on AMD MI355X and $0.002/image on NVIDIA B200. Volume pricing available for dedicated endpoints
Can FLUX.2 run on AMD GPUs
Yes. Modular is the only inference platform that runs Flux.2 natively on both NVIDIA and AMD from the same container. No code changes required.
Is the API OpenAI-compatible
Yes. Drop-in replacement with same images generate endpoint format. Switch from any provided by changing the base URL.
Does Modular support LoRA fine-tuned models?
Yes. Bring your own LoRA weights and MAX compiles and serves them with the same performance advantage.
How does this compare to vLLM for image generation?
vLLM and SGLang don’t support diffusion models since they’re LLM inference engines. MAX serves image models natively alongside LLMs in the same container and API
Start using Flux.2 with Modular
We'll show you Modular's benchmarks on workloads similar to yours.
Thank you for your submission.
Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.
Thank you,
Modular Sales Team
Oops! Something went wrong while submitting the form.Thank you for your submission.
Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.
Thank you,
Modular Sales Team