E-commerce giant Amazon.com is quickly becoming one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its customers' and its own storage needs. Alyssa Henry, vice president of storage services at Amazon Web Services, says that translates into about 1,500 objects for every person in the U.S. and one object for every star in the Milky Way galaxy.
Some of the objects in the database are fairly massive -- up to 5TB each -- and could be databases in their own right. Henry expects single-object size to get as high as 500TB by 2016. The secret to dealing with massive data, she says, is to split the objects into chunks, a process called parallelization.
In its S3 storage service, Amazon uses its own custom code to split files into 1,000MB pieces. This is a common practice, but what makes Amazon's approach unique is how the file-splitting process occurs in real time. "This always-available storage architecture is a contrast with some storage systems which move data between what are known as 'archived' and 'live' states, creating a potential delay for data retrieval," Henry explains.
Another problem in handling massive data is corrupt files. Most companies don't worry about the occasional corrupt file. Yet, when dealing with almost 450 billion objects, even low failure rates become challenging to manage.
Amazon's custom software analyzes every piece of data for bad memory allocations, calculates the checksums, and analyzes how fast an error can be repaired to deliver the throughput needed for cloud storage.
Mazda Motor Corp., with 900 dealers and 800 employees in the U.S., manages around 90TB of data. Barry Blakeley, infrastructure architect at Mazda's North American operations, says business units and dealers are generating ever-increasing amounts of data analytics files, marketing materials, business intelligence databases, Microsoft SharePoint data and more. "We have virtualized everything, including storage," says Blakeley. The company uses tools from Compellent, now part of Dell, for storage virtualization and Dell PowerVault NX3100 as its SAN, along with VMware systems to host the virtual servers.
The key, says Blakeley, is to migrate "stale" data quickly onto tape. He says 80% of Mazda's stored data becomes stale within months, which means the blocks of data are not accessed at all. To accommodate these usage patterns, the virtual storage is set up in a tiered structure. Fast solid-state disks connected by Fibre Channel switches make up the first tier, which handles 20% of the company's data needs. The rest of the data is archived to slower disks running at 15,000 rpm on Fibre Channel in a second tier and to 7,200-rpm disks connected by serial-attached SCSI in a third tier.
Sign up for CIO Asia eNewsletters.