Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

USENIX researchers get a grip on Hadoop performance

Joab Jackson | June 23, 2014
Modeling Hadoop jobs can be tricky because of all the moving parts, researchers say.

One type of job, Terasort, in which a large amount of data is sorted, can be completed five times more quickly by using a collection of small AWS instances compared to using the large instances.

The performance of another type of job, the Kmeans clustering algorithm, does not vary with the kind of instance used, however. It runs equally well on small, medium, or large instances, meaning the user can run a Kmeans job on the more cost-effective large instances without sacrificing any speed.

Cherkasova's work in this field has been important because to date there have been very few widely cited studies on modeling Hadoop performance, said Anshul Gandhi, an IBM researcher who was on the USENIX organizing committee for the conference.

Studying Hadoop can be a challenge because few researchers have access to large Hadoop systems, which are too costly to build and test, Gandhi said.

Also doing work in this realm has been Cristina Abad, a computer science Ph.D. candidate at the University of Illinois at Urbana-Champaign.

Abad has developed a benchmark designed to model the performance of next-generation storage systems, called MimesisBench, and has modeled a workload on a Yahoo 4,100 node cluster running on the Hadoop Distributed File System (HDFS).

The benchmark can help determine if a storage system can accommodate an increased workload, which can be valuable information for determining whether to make major architectural changes when increasing the throughput of a data processing system.

The benchmark showed, for instance, that the Yahoo cluster would start experiencing increased latency when handling approximately more than 16,800 operations per second, which was greater than was expected.

The benchmark could also help in other architectural decisions. For its storage system, Yahoo used a hierarchal namespace, in which files are organized into groups or subdirectories. If Yahoo were to use a flat namespace, where all the files are located in a single list, latency would have started spiking at about 10,284 operations per second, the model showed. 

 

Previous Page  1  2 

Sign up for CIO Asia eNewsletters.