When Google first told the world about its Tensor Processing Unit, the strategy behind it seemed clear enough: Speed machine learning at scale by throwing custom hardware at the problem. Use commodity GPUs to train machine-learning models; use custom TPUs to deploy those trained models.
The new generation of Google’s TPUs is designed to handle both of those duties, training and deploying, on the same chip. That new generation is also faster, both on its own and when scaled out with others in what’s called a “TPU pod.”
But faster machine learning isn’t the only benefit from such a design. The TPU, especially in this new form, constitutes another piece of what amounts to Google building an end-to-end machine-learning pipeline, covering everything from intake of data to deployment of the trained model.
Machine learning: A pipeline runs through it
One of the largest obstacles to using machine learning right now is how tough it can be to put together a full pipeline for the data—intake, normalization, model training, model and deployment. The pieces are still highly disparate and uncoordinated. Companies like Baidu have hinted at wanting to create a single, unified, unpack-and-go solution, but so far that’s just a notion.
The most likely place for such a solution to emerge is in the cloud. As time goes by, much more of the data collected for machine learning (and everything else, really) lives there by default. So does the hardware needed to produce actionable results from it. Give people a single end-to-end, in-the-cloud workflow for machine learning, one with only a few knobs on it by default, and they’ll be happy to build on top of it.
Already mostly realized, Google’s vision is that each phase of the pipeline can be executed in the cloud, as close as possible to the data, for the best possible speed. With TPUs, Google’s also seeks to provide many of the phases with custom hardware acceleration that can be scaled out on demand.
The new TPUs are meant to boost pipeline acceleration in several ways. One speedup comes from being able to gang multiple TPUs. Another comes from being able to train and deploy models from the same slab of silicon. With the latter, it’s easier to incrementally retrain models as new data comes in, because the data doesn’t have to be moved around as much.
That optimization—operating on data where it is to speed up operations on it—is also right in line with other machine learning performance improvements in the works, such as some proposed Linux kernel fixes and common APIs for machine learning data access.
Sign up for CIO Asia eNewsletters.