Photo: P.K. Gupta
Big data analytics on the other hand is more experimental and likely to be conducted on a ad hoc basis. It is also mostly semi-structured, and may require external as well as operational sources of data, and handles data volumes ranging from tens of terabytes to hundreds of petabytes.
But what is big data? According to Gupta, big data consists of large data set sizes characterised by volume (for example, terabytes in size, and millions in transactions, tables, records and files), velocity (in processing, whether in batch, near-time, real-time, or streaming), and variety (structured, unstructured or semistructured).
From another perspective, BI and analytics can be understood from the kinds of answers businesses are asking. Past data enables the business to understand "what happened" through reporting and dashboards, and "why did it happen" through forensics and data mining. "Real-time analytics answers 'what is happening', and real-time data mining seeks to answer 'why it is happening'," said Gupta. As the term suggests, predictive analytics answers "what is likely to happen" and prescriptive analytics will be for answering "what should be done about it".
Gupta highlighted the 10 use case for big data analytics (see table below: 10 Use Cases). Industries embracing big data include retail, financial services, manufacturing, government, advertising and public relations, media and telecom, energy, and healthcare and life sciences.
Quoting research firm Gartner, Gupta said: "Enterprises can gain a competitive advantage by being early adopters of big data analytics. Big data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing."
Simply, Hadoop is a scalable fault-tolerant distributed system for data storage and processing. The core Hadoop has two main components—the Hadoop distributed file system (HDFS) that enables self-healing, high-bandwidth clustered storage; and MapReduce, a fault-tolerant distributed processing underpinned by a programming model for processing sets of data, and for mapping inputs to outputs and reducing the outputs of multiple mappers to one or a few answers.
Deploying public, private and hybrid storage clouds
Thomas Chua, education instructor with SNIA South Asia, talked about deploying public, private and hybrid storage clouds. He pointed out there are five major characteristics of a cloud—on-demand self-service portal, resource pooling, rapid elasticity, measured service, and broad network access. Cloud services fall into three categories—software-as-a-service (Saas), platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS). However, what is deployed will usually depend on cost, whether it is a public cloud, private, community or hybrid cloud.
Photo: Thomas Chua
Sign up for CIO Asia eNewsletters.