By and large, software that uses GPU acceleration uses Nvidia’s CUDA libraries, which work only with Nvidia hardware. The open source OpenCL library provides vendor-neutral support across device types, but performance isn’t as good as it is with dedicated solutions like CUDA.
Rather than struggle with bringing OpenCL up to snuff—a slow, committee-driven process — AMD’s answer to all this has been to spin up its own open source GPU computing platform, ROCm, the Radeon Open Compute Platform. The theory is that it provides a language- and hardware-independent middleware layer for GPUs—primarily AMD’s own, but theoretically for any GPU. ROCm can also talk to GPUs by way of OpenCL if needed, but also provides its own direct paths to the underlying hardware.
There’s little question ROCm can provide major performance boosts to machine learning over OpenCL. A port of the Caffe framework to ROCm yielded something like an 80 percent speedup over the OpenCL version. What’s more, AMD is touting how the process of converting code to use ROCm can be heavily automated, another incentive for existing frameworks to try it. Support for other frameworks, like TensorFlow and MxNet, is also being planned.
AMD is playing the long game
The ultimate goal AMD has in mind isn’t complicated: Create an environment where its GPUs can work as drop-in replacements for Nvidia’s in the machine-learning space. Do that by offering as good, or better, hardware performance for the dollar, and by ensuring the existing ecosystem of machine-learning software will also work with its GPUs.
In some ways, porting the software is the easiest part. It’s mostly a matter of finding manpower enough to convert the needed code for the most crucial open source machine-learning frameworks, and then to keep that code up to date as both the hardware and the frameworks themselves move forward.
What’s likely to be toughest of all for AMD is finding a foothold in the places where GPUs are offered at scale. All the GPUs offered in Amazon Web Services, Azure, and Google Cloud Platform are strictly Nvidia. Demand doesn’t yet support any other scenario. But if the next iteration of machine-learning software becomes that much more GPU-independent, cloud vendors will have one less excuse not to offer Vega or its successors as an option.
Still, any plans AMD has to bootstrap that demand are brave.They’ll take years to get up to speed, because AMD is up against the weight of a world that has for years been Nvidia’s to lose.
Sign up for CIO Asia eNewsletters.