From the launch of ChatGPT in 2023, GenAI reshaped the tech industry—but GPUs didn’t suddenly appear overnight. Hardware companies have spent billions on AI chips for over a decade. Dozens of architectures. Countless engineering hours. And yet—still—NVIDIA dominates.
Why?
Because CUDA is more than an SDK. It’s a fortress of developer experience designed to lock you in—and a business strategy engineered to keep competitors perpetually two years behind. It’s not beloved. It’s not elegant. But it works, and nothing else comes close.
We’ve spent this series tracing the rise and fall of hopeful alternatives—OpenCL and SyCL, TVM and XLA, Triton, MLIR, and others. The pattern is clear: bold technical ambitions, early excitement, and eventual fragmentation. Meanwhile, the CUDA moat grows deeper.
The trillion-dollar question that keeps hardware leaders awake at night is: Given the massive opportunity—and developers desperate for alternatives—why can't we break free?
The answer isn’t incompetence. Hardware companies are filled with brilliant engineers and seasoned execs. The problem is structural: misaligned incentives, conflicting priorities, and an underestimation of just how much software investment is required to play in this arena. You don’t just need a chip. You need a platform. And building a platform means making hard, unpopular, long-term bets—without the guarantee that anyone will care.
In this post, we'll reveal the invisible matrix of constraints that hardware companies operate within—a system that makes building competitive AI software nearly impossible by design.
This is Part 9 of Modular’s “Democratizing AI Compute” series. For more, see:
- Part 1: DeepSeek’s Impact on AI
- Part 2: What exactly is “CUDA”?
- Part 3: How did CUDA succeed?
- Part 4: CUDA is the incumbent, but is it any good?
- Part 5: What about CUDA C++ alternatives like OpenCL?
- Part 6: What about AI compilers (TVM and XLA)?
- Part 7: What about Triton and Python eDSLs?
- Part 8: What about the MLIR compiler infrastructure?
- Part 9: Why do HW companies struggle to build AI software? (this article)
- Part 10: How do we move forward? (coming soon)
My career in HW / SW co-design
I live and breathe innovative hardware. I read SemiAnalysis, EE Times, Ars Technica—anything I can get my hands on about the chips, stacks, and systems shaping the future. Over decades, I’ve fallen in love with the intricate dance of hardware/software co-design: when it works, it’s magic. When it doesn’t… well, that’s what this whole series is about.
A few of my learnings:
- My first real job in tech was at Intel, helping optimize launch titles for the Pentium MMX—the first PC processor with SIMD instructions. There I learned the crucial lesson: without optimized software, a revolutionary silicon speedboat won’t get up to speed. That early taste of hardware/software interplay stuck with me.
- At Apple, I built the compiler infrastructure enabling a transition to in-house silicon. Apple taught me that true hardware/software integration requires extraordinary organizational discipline—it succeeded because instead of settling for a compromise, the teams shared a unified vision that no business unit can override.
- At Google, I scaled the TPU software stack alongside the hardware and AI research teams. With seemingly unlimited resources and tight HW/SW co-design, we used workload knowledge to deliver the power of specialized silicon — an incredible custom AI racing yacht.
- At SiFive, I switched perspectives entirely—leading engineering at a hardware company taught me the hard truths about hardware business models and organizational values.
Across all these experiences, one thing became clear: software and hardware teams speak different languages, move at different speeds, and measure success in different ways. But there's something deeper at work—I came to see an invisible matrix of constraints that shapes how hardware companies approach software, and explain why software teams struggle with AI software in particular.
Before we go further, let's step into the mindset of a hardware executive—where the matrix of constraints begins to reveal itself.
How AI hardware companies think
There’s no shortage of brilliant minds in hardware companies. The problem isn’t IQ—it’s worldview.
The architectural ingredients for AI chips are well understood by now: systolic arrays, TensorCores, mixed-precision compute, exotic memory hierarchies. Building chips remains brutally hard, but it's no longer the bottleneck for scalable success. The real challenge is getting anyone to use your silicon—and that means software.
GenAI workloads evolve at breakneck speed. Hardware companies need to design for what developers will need two years from now, not just what's hot today. But they're stuck in a mental model that doesn't match reality—trying to race in open waters with a culture designed for land.

In the CPU era, software was simpler: build a backend for LLVM and your chip inherited an ecosystem—Linux, browsers, compiled applications all worked. AI has no such luxury. There's no central compiler or OS. You're building for a chaotic, fast-moving stack—PyTorch, vLLM, today’s agent framework of the week—while your customers are using NVIDIA's tools. You're expected to make it all feel native, to just work, for AI engineers who neither understand your chip nor want to.
Despite this, the chip is still the product—and the P&L makes that crystal clear. Software, docs, tooling, community? Treated like overhead. This is the first constraint of the matrix: hardware companies are structurally incapable of seeing a software ecosystem as a standalone product. Execs optimize for capex, BOM cost, and tapeout timelines. Software gets some budget, but it’s never enough—especially as AI software demands scale up. The result is a demo-driven culture: launch the chip, write a few kernels, run some benchmarks, and build a flashy keynote that proves your FLOPS are real.
The result is painfully familiar: a technically impressive chip with software no one wants to use. The software team promises improvement next cycle. But they said that last time too. This isn't about individual failure—it's about systemic misalignment of incentives and resources in an industry structured around silicon, not ecosystems.
Why is GenAI software so hard and expensive to build?
Building GenAI software isn’t just hard—it’s a treadmill pointed uphill, on a mountain that’s constantly shifting beneath your feet. It’s less an engineering challenge than a perfect storm of fragmentation, evolving research, and brutal expectations—each components of the matrix.
🏃The treadmill of fragmented AI research innovation
AI workloads aren’t static—they’re a constantly mutating zoo. One week it’s Transformers; the next it’s diffusion, MoEs, or LLM agents. Then comes a new quantization trick, a better optimizer, or some obscure operator that a research team insists must run at max performance right now.
It is well known that you must innovate in hardware to differentiate, but often forgotten that every hardware innovation multiplies your software burden against a moving target of use cases. Each hardware innovation demands that software engineers deeply understand it—while also understanding the rapidly moving AI research and how to connect the two together.
The result? You’re not building a “stack”—you’re building a cross product of models × quantization formats × batch sizes × inference/training × cloud/edge × framework-of-the-week.
It's combinatorially explosive, which is why no one but NVIDIA can keep up. You end up with ecosystem maps that look like this:

🌍 You're competing with an industry, not just CUDA
The real problem isn't just CUDA—it's that the entire AI ecosystem writes software for NVIDIA hardware. Every framework, paper, and library is tuned for their latest TensorCores. Every optimization is implemented there first. This is the compounding loop explored in Part 3: CUDA is a software gravity well that bends the industry’s efforts toward NVIDIA’s hardware.
For alternative hardware, compatibility isn't enough—you have to outcompete a global open-source army optimizing for NVIDIA's chips. First you have to “run” the workload, but then it has to be better than the HW+SW combo they’re already using.
🥊 The software team is always outnumbered
No matter how many software engineers you have, it’s never enough to get ahead of the juggernaut - no matter how brilliant and committed, they’re just totally outmatched. Their inboxes are full of customer escalations, internal feature requests, and desperate pleas for benchmarks. They're fighting fires instead of building tools to prevent future fires, and they’re exhausted. Each major success just makes it clear how much more there is left to be done.
They have many ideas—they want to invest in infrastructure, build long-term abstractions, define the company’s software philosophy. But they can’t, because they can’t stop working on the current-gen chip long enough to prepare for the next one. Meanwhile, …
💰The business always “chases the whale”
When a massive account shows up with cash and specific requirements, the business says yes. Those customers have leverage, and chasing them always makes short-term sense.
But there’s a high cost: Every whale you reel in pulls the team further away from building a scalable platform. There’s no time to invest in a scalable torso-and-tail strategy that might unlock dozens of smaller customers later. Instead of becoming a product company, your software team is forced to operate like a consulting shop.
It starts innocently, but soon your engineers implement hacks, forks, half-integrations that make one thing fast but break five others. Eventually, your software stack becomes a haunted forest of tech debt and tribal knowledge. It’s impossible to debug, painful to extend, and barely documented—who had time to write docs? And what do we do when the engineer who understood it just left?
Challenges getting ahead in the hardware regatta
These aren't isolated problems—they're the universal reality of building GenAI software. The race isn't a sprint—it's a regatta: chaotic, unpredictable, and shaped as much by weather as by engineering. Everyone's crossing the same sea, but in radically different boats.

🚤 Speedboats: Startups aim for benchmarks, not generality or usability
Startups are in survival mode. Their goal is to prove the silicon works, that it goes fast, and that someone—anyone—might buy it. That means picking a few benchmark workloads and making them fly, using whatever hacks or contortions it takes. Generality and usability don’t matter—The only thing that matters is showing that the chip is real and competitive today. You’re not building a software stack. You’re building a pitch deck.
⛵ Custom Racing Yachts: Single-chip companies build vertical stacks
The Mag7 and advanced startups take a different tack. They build TPU racing yachts to win specific races with custom designs. They can be fast and beautiful—but only with their trained crew, their instruction manual, and often their own models. Because these chips leave GPU assumptions behind, they must build bespoke software stacks from scratch.
They own the entire stack because they have to. The result? More fragmentation for AI engineers. Betting on one of these chips means theoretical FLOPS at a discount—but sacrificing momentum from the NVIDIA ecosystem. The most promising strategy for these companies is locking in a few large customers: frontier labs or sovereign clouds hungry for FLOPS without the NVIDIA tax.
🛳️ Ocean Liners: Giants struggle with legacy and scale
Then come the giants: Intel, AMD, Apple, Qualcomm—companies with decades of silicon experience and sprawling portfolios: CPUs, GPUs, NPUs, even FPGAs. They’ve shipped billions of units. But that scale brings a problem: divided software teams stretched across too many codebases, too many priorities. Their customers can’t keep track of all the software and versions—where to start?
One tempting approach is to just embrace CUDA with a translator. It gets you “compatibility,” but never great performance. Modern CUDA kernels are written for Hopper’s TensorCores, TMA, and memory hierarchy. Translating them to your architecture won’t make your hardware shine.
Sadly, the best-case outcome at this scale is OneAPI from Intel—open, portable, and community-governed, but lacking momentum or soul. It hasn’t gained traction in GenAI for the same reasons OpenCL didn’t: it was designed for a previous generation of GPU workload, and AI moved too fast for it to keep up. Being open only helps if you also keep up.
🚢 NVIDIA: The carrier that commands the race
NVIDIA is the aircraft carrier in the lead: colossal, coordinated, and surrounded by supply ships, fighter jets, and satellite comms. While others struggle to build software for one chip, NVIDIA launches torpedos at anyone who might get ahead. While others optimize for a benchmark, the world optimizes for NVIDIA. The weather changes to match their runway.
If you’re in the regatta, you’re sailing into their wake. The question isn’t whether you’re making progress—it’s whether the gap is closing or getting wider.
Breaking out of the matrix
At this point in “Democratizing AI Compute”, we’ve mapped the landscape. CUDA isn't dominant by accident—it’s the result of relentless investment, platform control, and market feedback loops that others simply can’t replicate. Billions have been poured into alternatives: vertically-integrated stacks from Mag7 companies, open platforms from industry giants, and innovative approaches from hungry startups. None have cracked it.
But we’re no longer lost in the fog. We can see the matrix now: how these dynamics work, where the traps lie, why even the most brilliant software teams can't get ahead at hardware companies. The question is no longer why we’re stuck: It’s whether we can break free.

Child: "Do not try and bend the spoon. That's impossible. Instead... only try to realize the truth."
Neo: "What truth?"
Child: "There is no spoon. Then you'll see that it is not the spoon that bends, it is only yourself."
If we want to Democratize AI Compute, someone has to challenge the assumptions we’ve all been working within. The path forward isn't incremental improvement—it's changing the rules of the game entirely.
Let's explore that together in part 10.
-Chris