As more small and large operations realize the benefits of big data we are seeing an increase in solutions addressing specific problem domains for big data analysis. One of the challenges for big data is how to analyze collections of data distributed across different technology stacks. Apache Spark provides a computational engine that can pull data from multiple sources and analyze it using the common abstraction of resilient distributed datasets, or RDDs. Its core operation types include transformations to massage your data and convert it from its source format to the form you want to analyze, and actions that derive your business value.
In this article we've built a small, locally run Spark application whose purpose is to count words. You've practiced several different transformations and one save action, and had an overview of Spark's programming and execution models. I've also discussed Spark's support for running distributed across multiple machines by leveraging a cluster manager such as Hadoop YARN and Apache Mesos with drivers and executors. Finally, I discussed extensions built on top of Spark for analyzing data in an SQL database, from a streaming source, and to perform applied analysis solutions for machine learning and graph processing.
We've only scratched the surface of what is possible with Spark, but I hope that this article has both inspired you and equipped you with the fundamentals to start analyzing Internet-scale collections of data using Spark.
Sign up for CIO Asia eNewsletters.