Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Learn to crunch big data with R

Martin Heller | Feb. 12, 2015
Get started using the open source R programming language to do statistical computing and graphics on large data sets.

fm1 <- lm(y ~ x, data=dummy, weight=1/w^2)summary(fm1) 

This says "find the best fit coefficients, fitted values, and residuals for a linear model where y varies with x for the supplied data and weight vectors. Save them in object fm1 and then summarize the results." Earlier in this session we had defined the following:

w <- 1 + sqrt(x) / 2

Reading this code is straightforward. Writing it takes a little study. But isn't hard and there's lots of free help available, not to mention dozens of books.

In addition to the R help available on the Web and from the Help menu items in the R Console and RStudio, you can get help from the R command line. For example:

?functionNamehelp(functionName)example(functionName)args(functionName)"your search term")??("my search term")

To get data into R, either use its sample data, listed by the data() function, or load it from a file:

mydata <- read.csv("filename.txt")

R is extremely extensible. The library() and require() functions load and attach add-on packages; require() is designed for use inside other functions. Many add-on packages and the R distributions live in CRAN, the worldwide Comprehensive R Archive Network. The other two common R archives are Omegahat and Bioconductor. Additional packages live in R-Forge.

The R installation copies the base packages and the recommended packages from CRAN into a local library directory, which on a Mac is currently at /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. Running the R library() command without any arguments will list the local packages and the library location. RStudio will also generate the correct library() command to install a listed package when you check the installation check mark in the Packages tab. The command help(package = packageName) will display the functions in the specified package.

There are R packages and functions to load data from any reasonable source, not only CSV files. Beyond the obvious case of delimiters other than commas, which are handled using the read.table() function, you can copy and paste data tables, read Excel files, connect Excel to R, bring in SAS and SPSS data, and access databases, Salesforce, and RESTful interfaces. See, for example, the foreign package.

You don't really need to learn the syntax for standard data imports, as the RStudio Tools|Import Dataset menu item will help you generate the correct commands interactively by looking at the data from a text file or URL and setting the correct conversion options in drop-down lists based on what you see.

You can see a list of the currently available packages by name on CRAN; this list is much more extensive than the list of recommended packages downloaded to your desktop by default. To install a package from one of the default archives, use the install.packages function:


Previous Page  1  2  3  4  5  Next Page 

Sign up for CIO Asia eNewsletters.