Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Learn to crunch big data with R

Martin Heller | Feb. 12, 2015
Get started using the open source R programming language to do statistical computing and graphics on large data sets.

install.packages("ggplot2")

Note that ggplot2 is a popular advanced graphics package that has more options than the standard graphics package. Nevertheless, graphics can do a lot. In addition to the graphics in Figures 2 and 3, consider Figures 4 and 5.

R can do much more in terms of graphics and statistical analysis. Do read Sharon Machlis's tutorial and follow up with her links to additional information. At this point, I want to expand my discussion to how you can analyze big data in R.

R in the cloud

When R programmers talk about "big data," they don't necessarily mean data that goes through Hadoop. They generally use "big" to mean data that can't be analyzed in memory.

The fact is you can easily get 16GB of RAM in a desktop or laptop computer. R running in 16GB of RAM can analyze millions of rows of data with no problem. Times have changed quite a bit since the days when a database table with a million rows was considered big.

One of the first steps many developers take when their program needs more RAM is to run it on a bigger machine. You can run R on a server; a common 4U Intel server can hold up to 2TB of RAM. Of course, hogging an entire 2TB server for one personal R instance might be a bit wasteful. So people run large cloud instances for as long as they need them, run VMs on their server hardware, or run the likes of RStudio Server on their server hardware.

RStudio Server comes in Free and Pro editions. Both have the same features for individual analysts, but the Pro version offers more in the way of scale: authorization and security, management visibility, performance tuning, support, and a commercial license. According to Roger Oberg of RStudio, the company's intent is not to create paid-only features for individuals.

RStudio Server Pro is integrated with several big data systems. For example, when I was reviewing the IBM Bluemix PaaS, I noticed that R and RStudio are part of IBM's DashDB service (Figure 6). In fact, this is an installation of RStudio Server Pro on Bluemix and SoftLayer, according to Oberg and Tareef Kawaf of RStudio.

There is an additional strategy for running R against big data: Bring down only the data that you need to analyze. In the spirit of MapReduce, Hadoop, Spark, and Storm, you want to winnow the data as you stream it to make in-memory analysis tractable on the reduced data set. To use Kawaf's example, you may have 100TB of data but need "only" 5 columns and 20 million rows, a mere few hundred megabytes of reduced data.

 

Previous Page  1  2  3  4  5  Next Page 

Sign up for CIO Asia eNewsletters.