Pencil Banner

# Beginner's guide to R: Easy ways to do basic data analysis

| June 7, 2013

What if want to select your data by data characteristic, such as "all cars with mpg > 20", and not column or row location? If you use the column name notation and add a condition like:

mtcars\$mpg>20

you don't end up with a list of all rows where mpg is greater than 20. Instead, you get a vector showing whether each row meets the condition, such as:

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE

[10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

[19] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE

[28] TRUE FALSE FALSE FALSE TRUE

To turn that into a listing of the data you want, use that logical test condition and row-comma-column bracket notation. Remember that this time you want to select rows by condition, not columns. This:

mtcars[mtcars\$mpg>20,]

tells R to get all rows from mtcars where mpg > 20, and then to return all the columns.

If you don't want to see all the column data for the selected rows but are just interested in displaying, say, mpg and horsepower for cars with an mpg greater than 20, you could use the notation:

mtcars[mtcars\$mpg>20,c(1,4)]

using column locations, or:

mtcars[mtcars\$mpg>20,c("mpg","hp")]

using the column names.

Why do you need to specify mtcars\$mpg in the row spot but "mpg" in the column spot? Just another R syntax quirk is the best answer I can give you.

If you're finding that your selection statement is starting to get unwieldy, you can put your row and column selections into variables first, such as:

mpg20 <- mtcars\$mpg > 20

cols <- c("mpg", "hp")

Then you can select the rows and columns with those variables:

mtcars[mpg20, cols]

making for a more compact select statement but more lines of code.

Getting tired of including the name of the data set multiple times per command? If you're using only one data set and you are not making any changes to the data that need to be saved, you can attach and detach a copy of the data set temporarily.

The attach() function works like this:

attach(mtcars)

So, instead of having to type:

mpg20 <- mtcars\$mpg > 20

You can leave out the data set reference and type this instead:

mpg20 <- mpg > 20

After using attach() remember to use the detach function when you're finished:

detach()

Some R users advise avoiding attach() because it can be easy to forget to detach(). If you don't detach() the copy, your variables could end up referencing the wrong data set.

Alternative to bracket notation
Bracket syntax is pretty common in R code, but it's not your only option. If you dislike that format, you might prefer the subset() function instead, which works with vectors and matrices as well as data frames. The format is: