Now that big data technologies like Apache Hadoop are moving into the enterprise, system engineers must start building models that can estimate how much work these distributed data processing systems can do and how quickly they can get their work done.
Having accurate models of big data workloads means organizations can better plan and allocate resources to these jobs, and can confidently assert when the results of this work can be delivered to customers.
Estimating big data jobs, however, is tricky business, and the process cannot rely entirely on traditional modeling tools, according to researchers speaking at the USENIX annual conference on autonomic computing, being held this week in Philadelphia.
"It's almost impossible to be accurate, because you are dealing with a non-deterministic system," said Lucy Cherkasova, a researcher at Hewlett-Packard Labs.
She explained that Hadoop systems are non-deterministic because they have a wide range of variable factors that can contribute to how long it takes for a job to finish.
The average Hadoop system might have up to 190 parameters to set in order to start running, and each Hadoop job may have different requirements for how much computation, bandwidth, memory or other resources it needs.
Cherkasova has been working on models, and associated tools, to estimate how long a large data processing job will take to run on Hadoop or other large data processing systems, in a project called ARIA (Automatic Resource Inference and Allocation for MapReduce Environments).
ARIA aims to answer the question, "How many resources should I allocate to this job, if I want to process this data by this deadline," Cherkasova said.
One might assume that if you double the number of resources of a Hadoop job, the time required to complete the job would be cut in half. "This is not the case" with Hadoop, Cherkasova said.
Job profiles can change in non-linear ways depending on the number of servers being used. The performance bottlenecks in a Hadoop cluster for 66 nodes are different from the bottlenecks found in a Hadoop cluster of 1,000 nodes, she said.
The performance can vary according to the type of job as well. Some of the research Cherkasova carried out involved studying what sized virtual machine would be best suited for Hadoop jobs.
For instance, Amazon Web Services (AWS) offers a range of virtual servers, from small instances with a single processor to larger ones with eight or more processors. Because Hadoop is a distributed system, it was made to run on multiple servers. But would it be more cost-effective to run Hadoop across many smaller instances, or on fewer though larger smaller instances?
Cherkasova found that the answer depends on the workload.
Sign up for CIO Asia eNewsletters.