Pencil Banner

# Beginner's guide to R: Easy ways to do basic data analysis

| June 7, 2013

summary(mydata)

That returns some basic calculations for each column. If the column has numbers, you'll see the minimum and maximum values along with median, mean, 1st quartile and 3rd quartile. If it's got factors such as fair, good, very good and excellent, you'll get the number of each factor listed in the column.

The summary() function also returns stats for a 1-dimensional vector.

If you'd like even more statistical summaries from a single command, install and load the psych package. Install it with this command:

install.packages("psych")

You need to run this install only once on a system. Then load it with:

library(psych)

You need to run the library command each time you start a new R session if you want to use the psych package.

Now try the command:

describe(mydata)

and you'll get several more statistics from the data including standard deviation, "mad" (mean absolute deviation), skew (measuring whether or not the data distribution is symmetrical) and kurtosis (whether the data have a sharp or flatter peak near its mean).

R has the statistical functions you'd expect, including mean(), median(), min(), max(), sd() [standard deviation], var() [variance] and range()which you can run on a 1-dimensional vector of numbers. (Several of these functions -- such as mean() and median() -- will not work on a 2-dimensional data frame).

Oddly, the mode() function returns information about data type instead of the statistical mode; there's an add-on package, modeest, that adds a mfv() function (most frequent value) to find the statistical mode.

R also contains a load of more sophisticated functions that let you do analyses with one or two commands: probability distributions, correlations, significance tests, regressions, ANOVA (analysis of variance between groups) and more.

As just one example, running the correlation function cor() on a dataframe such as:

cor(mydata)

will give you a matrix of correlations for each column of numerical data compared with every other column of numerical data.

Note: Be aware that you can run into problems when trying to run some functions on data where there are missing values. In some cases, R's default is to return NA even if just a single value is missing. For example, while the summary() function returns column statistics excluding missing values (and also tells you how many NAs are in the data), the mean() function will return NA if even only one value is missing in a vector.

In most cases, adding the argument:

na.rm=TRUE

to NA-sensitive functions will tell that function to remove any NAs when performing calculations, such as:

mean(myvector, na.rm=TRUE)

If you've got data with some missing values, read a function's help file by typing a question mark followed by the name of the function, such as: