The Apache Hadoop project includes four modules: Hadoop Common (utilities), Hadoop Distributed File System (HDFS), Hadoop YARN (scheduler) and Hadoop MapReduce (parallel processing). On top of or instead of these, people often use one or more of the related projects: Ambari (cluster management), Avro (data serialization), Cassandra (multi-master database), Chukwa (data collection), HBase (distributed database), Hive (data warehouse), Mahout (ML and data mining), Pig (execution framework), Spark (compute engine), Tez (data-flow programming framework intended to replace MapReduce), and ZooKeeper (coordination service).
If that isn't complicated enough, factor in Apache Storm (stream processing) and Kafka (message transfer). Now consider the value added by vendors: Amazon (Elastic Map Reduce), Cloudera, Hortonworks, Microsoft (HDInsight), MapR, and SAP Altiscale. Confused yet?
Heating up: R language
Who: Data scientists with strong statistics
Data scientists have a number of option to analyze data using statistical methods. One of the most convenient and powerful methods is to use the free R programming language. R is one of the best ways to create reproducible, high-quality analysis, since unlike a spreadsheet, R scripts can be audited and re-run easily. The R language and its package repositories provide a wide range of statistical techniques, data manipulation and plotting, to the point that if a technique exists, it is probably implemented in an R package. R is almost as strong in its support for machine learning, although it may not be the first choice for deep neural networks, which require higher-performance computing than R currently delivers.
R is available as free open source, and is embedded into dozens of commercial products, including Microsoft Azure Machine Learning Studio and SQL Server 2016.
Heating up: Deep neural networks
Who: Data scientists
Some of the most powerful deep learning algorithms are deep neural networks (DNNs), which are neural networks constructed from many layers (hence the term "deep") of alternating linear and nonlinear processing units, and are trained using large-scale algorithms and massive amounts of training data. A deep neural network might have 10 to 20 hidden layers, whereas a typical neural network may have only a few.
The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train. Packages for creating deep neural networks include Caffe, Microsoft Cognitive Toolkit, MXNet, Neon, TensorFlow, Theano, and Torch.
Cooling down: IoT
Who: BI/BA pros, data scientists
The Internet of Things (IoT) may be the most-hyped set of technologies, ever. It may also be the worst thing that happened to Internet security, ever.
IoT has been touted for smart homes, wearables, smart cities, smart grids, industrial internet, connected vehicles, connected health, smart retail, agriculture, and a host of other scenarios. Many of these applications would make sense if the implementation was secure, but by and large that hasn't happened.
Sign up for CIO Asia eNewsletters.