Data exchange between components of a commodity computer when executing a task
Arrow A: Transferring data from an HDD to DRAM (a common initial step for both CPU and GPU computing)
Arrow B: Processing data with a CPU (transferring data: DRAM ’ chipset ’ CPU)
Arrow C: Processing data with a GPU (transferring data: DRAM ’ chipset ’ CPU ’ chipset ’ GPU ’ GDRAM ’ GPU)
As a result, the total amount of time we need to complete any task includes:
- the amount of time required for a CPU or a GPU to carry out computations
- plus the amount of time spent on data transfer between all of the components
According to Tom's Hardware (CPU Charts 2012), performance of an average CPU ranges from 15 to 130 GFLOPS. At the same time, performance of Nvidia GPUs, for instance, varies within a range of 100-3,000+ GFLOPS (2012 comparison). All of these measurements are approximate and largely depend on the type of task and the algorithm. Anyway, in some scenarios, a GPU can speed up computations by nearly five to 25 times per node. Some developers claim that if your cluster consists of several nodes, performance can be accelerated by 50x-200x. For example, the creators of the MITHRA project achieved a 254x increase.
However, what about the impact of data transfer? Different types of hardware transfer data at different speeds. Although supercomputers are most likely optimized for working with GPUs, a regular computer or server may be much slower when exchanging data.
While the rate of transferring data between an average CPU and a chipset is 10-20GBps (see Point Y on Figure 1), a GPU exchanges data with DRAM at the speed of 1-10GBps (see Point X). Although some systems may reach up to 10 GBps (PCIe v3), in most standard configurations data flows between a GPU's DRAM (GDRAM) and the DRAM of the computer at the rate of 1GBps. (It is recommended to measure the actual values on real hardware, since CPU memory bandwidth [X and Y] and the corresponding data transfer rates [C and B] can be about the same or differ by a factor of 10.)
Therefore, though a GPU provides faster computing, the main bottleneck is slow data transfer between GPU memory and CPU memory (Point X). Thus, for every particular project, you need to measure the time spent on data transfer from/to a GPU (Arrow !) against the time saved due to GPU acceleration. Therefore, the best thing is to evaluate the actual performance on a small cluster and then estimate how the system will behave on a larger scale.
Sign up for CIO Asia eNewsletters.