GenAI may be new, but GPUs aren’t! Over the years, many have tried to create portable GPU programming models using C++, from OpenCL to SYCL to OneAPI and beyond. These were the most plausible CUDA alternatives that aimed to democratize AI compute, but you may have never heard of them - because they failed to be relevant for AI.
These projects have all contributed meaningfully to compute, but if we are serious about unlocking AI compute for the future, we must critically examine the mistakes that held them back—not just celebrate the wins. At a high level, the problems stem from the challenges of "open coopetition"—where industry players both collaborate and compete—as well as specific management missteps along the way.
Let’s dive in. 🚀
This is Part 5 of Modular’s “Democratizing AI Compute” series. For more, see:
- Part 1: DeepSeek’s Impact on AI
- Part 2: What exactly is “CUDA”?
- Part 3: How did CUDA succeed?
- Part 4: CUDA is the incumbent, but is it any good?
- Part 5: What about CUDA C++ Alternatives? (this article)
- Part 6: What about AI compilers like TVM, XLA, and MLIR? (coming soon)
CUDA C++ Alternatives: OpenCL, SYCL, and More
There are many projects that aimed to unlock GPU programming, but the one I know best is OpenCL. Like CUDA, OpenCL aimed to give programmers a C++-like experience for writing code that ran on the GPU. The history is personal: in 2008, I was one of the lead engineers implementing OpenCL at Apple (it was the first production use of the Clang compiler I was building). After we shipped it, we made the pivotal decision to contribute it to the Khronos Group so it could get adopted and standardized across the industry.
That decision led to broad industry adoption of OpenCL (see the logos), particularly in mobile and embedded devices. Today, it remains hugely successful, powering GPU compute on platforms like Android, as well as in specialized applications such as DSPs. Unlike CUDA, OpenCL was designed for portability from the outset, aiming to support heterogeneous compute across CPUs, GPUs, and other accelerators. OpenCL also inspired other systems like SyCL, Vulkan, SPIR-V, oneAPI, WebCL and many others.
However, despite its technical strengths and broad adoption, OpenCL never became the dominant AI compute platform. There are several major reasons for this: the inherent tensions of open coopetition, technical problems that flowed from that, the evolving requirements of AI, and NVIDIA’s unified strategy with TensorFlow and PyTorch.
“Coopetition” at Committee Speed
In 2008, Apple was a small player in the PC space, and thought that industry standardization would enable it to reach more developers. However, while OpenCL did gain broad adoption among hardware makers, its evolution quickly ran into a major obstacle: the speed of committee-driven development. For Apple, this slow-moving, consensus-driven process was a dealbreaker: we wanted to move the platform rapidly, add new features (e.g. add C++ templates), and express the differentiation of the Apple platform. We faced a stark reality - the downside of a committee standard is that things suddenly moved at committee consensus speed… which felt glacial.
Hardware vendors recognized the long-term benefits of a unified software ecosystem, but in the short term, they were fierce competitors. This led to subtle but significant problems: instead of telling the committee about the hardware features you’re working on (giving a competitor a head start), participants would keep innovations secret until after the hardware shipped, and only discuss it after these features became commoditized (using vendor-specific extensions instead).

This became a huge problem for Apple, a company that wanted to move fast in secret to make a big splash with product launches. As such, Apple decided to abandon OpenCL: it introduced Metal instead, never brought OpenCL to iOS, and deprecated it out of macOS later. Other companies stuck with OpenCL, but these structural challenges continued to limit its ability to evolve at the pace of cutting-edge AI and GPU innovation.
Technical Problems with OpenCL
While Apple boldly decided to contribute the OpenCL standard to Kronos, it wasn’t all-in: it contributed OpenCL as a technical specification—but without a full reference implementation. Though parts of the compiler front-end (Clang) was open source, there was no shared OpenCL runtime, forcing vendors to develop their own custom forks and complete the compiler. Each vendor had to maintain its own implementation (a ”fork”), and without a shared, evolving reference, OpenCL became a patchwork of vendor-specific forks and extensions. This fragmentation ultimately weakened its portability—the very thing it was designed to enable.
Furthermore, because vendors held back differentiated features or isolated them into vendor-specific extensions, which exploded in number and fragmented OpenCL (and the derivatives), eroding its ability to be a unifying vendor-agnostic platform. These problems were exacerbated by weaknesses in OpenCL’s compatibility and conformance tests. On top of that, it inherited all the “C++ problems” that we discussed before.
Developers want stable, well-supported tools—but OpenCL’s fragmentation, weak conformance tests, and inconsistent vendor support made it an exercise in frustration. One developer summed it up by saying that using OpenCL is “about as comfortable as hugging a cactus”! Ouch.

While OpenCL was struggling with fragmentation and slow committee-driven evolution, AI was rapidly advancing—both in software frameworks and hardware capabilities. This created an even bigger gap between what OpenCL offered and what modern AI workloads needed.
The Evolving Needs of AI Research and AI GPU Hardware
The introduction of TensorFlow and PyTorch kicked off a revolution in AI research - powered by improved infrastructure and massive influx of BigCo funding. This posed a major challenge for OpenCL. While it enabled GPU compute, it lacked the high-level AI libraries and optimizations necessary for training and inference at scale. Unlike CUDA, it had no built-in support for key operations like matrix multiplication, Flash Attention, or datacenter-scale training.
Cross-industry efforts to expand TensorFlow and PyTorch to use OpenCL quickly ran into fundamental roadblocks (despite being obvious and with incredible demand). The developers who kept hugging the cactus soon discovered a harsh reality: portability to new hardware is meaningless if you can’t unlock its full performance. Without a way to express portable hardware-specific enhancements—and with coopetition crushing collaboration—progress stalled.
One glaring example? OpenCL still doesn’t provide standardized support for Tensor Cores—the specialized hardware units that power efficient matrix multiplications in modern GPUs and AI accelerators. This means that using OpenCL often means a 5x to 10x slowdown in performance compared to using CUDA or other fragmented vendor native software. For GenAI, where compute costs are already astronomical, a 5x to 10x slowdown isn’t just inconvenient—it’s a complete dealbreaker.
NVIDIA’s Strategic Approach with TensorFlow and PyTorch
While OpenCL struggled under the weight of fragmented governance, NVIDIA took a radically different approach—one that was tightly controlled, highly strategic, and ruthlessly effective, as we discussed earlier. It actively co-designed CUDA’s high-level libraries alongside TensorFlow and PyTorch, ensuring they always ran best on NVIDIA hardware. Since these frameworks were natively built on CUDA, NVIDIA had a massive head start—and it doubled down by optimizing performance out of the box.
NVIDIA maintained a token OpenCL implementation—but it was strategically hobbled (e.g., not being able to use TensorCores)—ensuring that a CUDA implementation would always be necessary. NVIDIA’s continued and rising dominance in the industry put it on the path to ensure that the CUDA implementations would always be the most heavily invested in. Over time, OpenCL support faded, then vanished—while CUDA cemented its position as the undisputed standard.
What Can We Learn From These C++ GPU Projects?
The history above is well understood by those of us who lived through it, but the real value comes from learning from the past. Based on this, I believe successful systems must:
- Provide a reference implementation, not just a paper specification and “compatibility” tests. A working, adoptable, and scalable implementation should define compatibility—not a PDF.
- Have strong leadership and vision driven by whoever maintains the reference implementation.
- Run with top performance on the industry leader’s hardware—otherwise, it will always be a second-class alternative, not something that can unify the industry.
- Evolve rapidly to meet changing requirements, because AI research isn’t stagnant, and AI hardware innovation is still accelerating.
- Cultivate developer love, by providing great usability, tools and fast compile times. Also, “C++ like” isn’t exactly a selling point in AI!
- Build an open community, because without widespread adoption, technical prowess doesn’t matter.
- Avoid fragmentation—a standard that splinters into incompatible forks can’t provide an effective unification layer for software developers.
These are the fundamental reasons why I don’t believe that committee efforts like OpenCL can ever succeed. It’s also why I’m even more skeptical of projects like Intel’s OneAPI (now UXL Foundation) that are notionally open, but in practice, controlled by a single hardware vendor competing with all the others.
What About AI Compilers?
At the same time that C++ approaches failed to unify AI compute for hardware makers, the AI industry faced a bigger challenge—even using CUDA on NVIDIA hardware. How can we scale AI compute if humans have to write all the code manually? There are too many chips, too many AI algorithms, and too many workload permutations to optimize by hand.
As AI’s dominance grew, it inevitably attracted interest from systems developers and compiler engineers—including myself. In the next post, we’ll dive into widely known “AI compiler” stacks like TVM, OpenXLA, and MLIR—examining what worked, what didn’t, and what lessons we can take forward. Unfortunately, the lessons are not wildly different than the ones above:
History may not repeat itself, but it does rhyme. - Mark Twain
See you next time—until then, may the FLOPS be with you! 👨💻
-Chris
What’s Next?
Learn more about the MAX Platform and the Mojo programming language, and join us in building the next wave of AI innovation.
- Get started with MAX–explore our latest platform to accelerate your AI work.
- Check out the the Mojo manual and get started with our next-generation language for programming GPUs.
- Share your thoughts on this post and questions for Chris in the Modular forum. Chris will answer questions on this series in the next community meeting on Monday, March 10th.