Pencil Banner

# Learn to crunch big data with R

| Feb. 12, 2015
Get started using the open source R programming language to do statistical computing and graphics on large data sets.

A few years ago I was the CTO and co-founder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimize everyone's time. Too often, office visits are scheduled to optimize the physician's time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.

One of my co-founders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data.

After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.

One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Excel and other GUI analysis programs, is completely auditable.

Alas, that startup ran out of money not long after I implemented a proof-of-concept Web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.

Essential R scripting

Sharon Machlis of Computerworld wrote an excellent set of beginner tutorials on R for business intelligence in 2013. It would be silly for me to reinvent those six articles here, so feel free to go read them and come back. The TL;DR version is as follows.

Start by installing R and RStudio on your desktop. Both are free. RStudio is optional, but I like it, and you probably will, too. There are a half-dozen other R IDEs and a dozen editors with some R support, but don't go crazy trying them all.

Try running R from a command shell (Figure 1), the R Console (Figure 2), and RStudio (Figure 3). Familiarize yourself with some of the R tutorials and demos.

The power of R is illustrated by the deceptively simple calls in Figure 3 to do statistical analysis. For example,