NEW YORK— Microsoft today used the first day of O'Reilly Strata Conference + Hadoop World in New York City to announce that its Windows Azure HDInsight Service is now generally available after a year in preview.
The HDInsight Service, designed in partnership with Hadoop specialist Hortonworks, makes standard Apache Hadoop available as a service in Microsoft's Azure cloud, allowing you to deploy Hadoop clusters in minutes and shut them down just as easily.
Integration with the Microsoft data platform means that you can access and analyze your data with PowerPivot, Power View and other Microsoft BI tools, like Microsoft SQL Server Analysis Services (SSAS).
"Hadoop is a cornerstone of big data," says Quentin Clark, corporate vice president, Microsoft Data Platform. "The need for the insights and results and transformations from big data is really there. There are companies talking to us about how they don't feel they can even be competitive without embracing the big data phenomenon."
"The estimated 2,000 DNA sequences worldwide are generating 15 petabytes of genome data every year. Many institutions simply do not have the computational and storage resources required to work with data sets of this size. We're generating data faster than we can analyze it." — Wu Feng, professor of Computer Science, Virginia Tech<
The goal, Clark says, is to bring Hadoop together with the flexibility of cloud deployment and the security that enterprises require to help customers achieve the competitive edge they need.
DNA Sequencing with HDInsight Service
The use cases are many and varied. For instance, Virginia Polytechnic Institute and State University has been using the HDInsight Service to aid its life sciences research in DNA sequencing.
Leveraging a grant from the National Science Foundation, Virginia Tech computer scientists developed an on-demand, cloud computing model using Windows Azure HDInsight Service that helps locate undetected genes in a massive genome database.
"Of the estimated 2,000 DNA sequences worldwide, they are generating 15 petabytes of genome data every year," says Wu Feng, professor of Computer Science at Virginia Tech. "Many life sciences institutions simply do not have access to the computational and storage resources required to work with data sets of this size. We're generating data faster than we can analyze it."
Fend and his team used the grant to develop two software artifacts: SeqInCloud, a popular genetic variant pipeline called the Genome Analysis Toolkit (GATK), and CloudFlow, a workflow management framework that uses both client and cloud resources.
SeqInCloud generalizes the GATK pipeline, allowing it to run in the cloud using HDInsight and Azure to maximize portability. Meanwhile, CloudFlow, installed on a researcher's PC, aids interactions with the Windows Azure HDInsight Service.
Sign up for CIO Asia eNewsletters.