Hadoop, an open source framework that enables distributed computing, has changed the way we deal with big data. Parallel processing with this set of tools can improve performance several times over. The question is, can we make it work even faster? What about offloading calculations from a CPU to a graphics processing unit (GPU) designed to perform complex 3D and mathematical tasks? In theory, if the process is optimized for parallel computing, a GPU could perform calculations 50-100 times faster than a CPU.
This article, written by the R&D team of Altoros Systems, a big data specialist and platform-as-a-service enabler, explores what is possible and how you can try this for your large-scale system.
The idea itself isn't new. For years scientific projects have tried to combine the Hadoop or MapReduce approach with the capacities of a GPU. Mars seems to be the first successful MapReduce framework for graphics processors. The project achieved a 1.5x-16x increase in performance when analyzing Web data (search/logs) and processing Web documents (including matrix multiplication).
Following Mars' groundwork, other scientific institutions developed similar tools to accelerate their data-intensive systems. Use cases included molecular dynamics, mathematical modeling (e.g., the Monte Carlo method), block-based matrix multiplication, financial analysis, image processing, etc.
On top of that, there is BOINC, a fast-evolving, volunteer-driven middleware system for grid computing. Although it does not use Hadoop, BOINC has already become a foundation for accelerating many scientific projects. For instance, GPUGRID relies on BOINC's GPU and distributed computing to perform molecular simulations that help "to understand the function of proteins in health and disease." Most of other BOINC projects related to medicine, physics, mathematics, biology, etc. could be implemented using Hadoop+GPU, too.
So, the demand for accelerating parallel computing systems with GPUs does exist. Institutions invest in supercomputers with GPUs or develop their own solutions. Hardware vendors, such as Cray, have released machines equipped with GPUs and pre-configured with Hadoop. Amazon has also launched Elastic MapReduce (Amazon EMR), which enables Hadoop on its cloud servers with GPUs.
But does one size fit all? Supercomputers provide the highest possible performance, yet cost millions of dollars. Using Amazon EMR is feasible only in projects that last for several months. For larger scientific projects (two to three years), investing in your own hardware may be more cost-effective. Even if you increase the speed of calculations using GPU within your Hadoop cluster, what about performance bottlenecks related to data transfer? Let's explore this in detail.
How it works
Data processing implies data exchange between HDD, DRAM, CPU and GPU. Figure 1 shows how data is transferred when a commodity machine performs computations with a CPU and a GPU.
Sign up for CIO Asia eNewsletters.