# install.packages("tidyverse")
# install.packages("here")
library(tidyverse)
library(here)
<- readRDS(file = here::here("raw_data", "athletes.rds")) athletes
Subsetting
Subsetting means extracting smaller sets of data from a bigger data set. For example, we can extract specific rows from a data frame, or specific values from a vector. Let’s take a look at how that is done in R:
Data frames
In R
we use square brackets [,]
to extract specific rows and columns.
Rows
In front of the ,
we write the rows we want to extract:
# Extract the first and the fourth row
c(1, 4), ] athletes[
NOC ID Name Sex Age Height Weight Team Games
1 AFG 132181 Najam Yahya M NA NA NA Afghanistan 1956 Summer
4 AFG 502 Ahmad Shah Abouwi M NA NA NA Afghanistan 1956 Summer
Year Season City Sport Event Medal Region
...
Columns
Behind it the columns:
# Extract the second and the fourth column:
c(2, 4)] athletes[,
ID Sex
1 132181 M
2 87371 M
3 44977 M
...
# Extract the columns by name:
c("Year", "Sport")] athletes[,
Year Sport
1 1956 Hockey
2 1948 Hockey
3 1980 Wrestling
...
# Or only the column Year (and turn it into a vector right away):
$Year athletes
[1] 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 1960 1948 1980 1948
[15] 1960 1936 1960 1968 1948 1972 1956 1980 1956 2016 1968 1948 1980 1936
[29] 1988 1948 1956 1988 1956 1972 1960 1980 1972 2004 1980 1960 1972 1980
[43] 1956 1964 1948 2008 1996 1980 1968 1960 1972 1972 1948 1936 2004 1936
...
Always use column names instead of position if possible. This way, your code will still work if the column position changes.
Rows & Columns
And of course we can combine both calls:
c(1, 4), c(2, 4)] athletes[
ID Sex
1 132181 M
4 502 M
c(1, 4), c("Year", "Sport")] athletes[
Year Sport
1 1956 Hockey
4 1956 Hockey
We can also use Boolean values (every row/column must get a value here, so we extract the first 100 rows by repeating TRUE
100 times, and than add FALSE
for the remaining rows):
str(athletes[c(rep(TRUE, 100), rep(FALSE, 271016)), ])
'data.frame': 100 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: int NA NA 163 NA NA 168 NA NA NA NA ...
$ Weight: num NA NA 57 NA 74 73 NA NA 57 NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
Instead of writing 271016
we should actually use the current row number, in case that changes as well:
str(athletes[c(rep(TRUE, 100), rep(FALSE, nrow(athletes) - 100)), ])
'data.frame': 100 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: int NA NA 163 NA NA 168 NA NA NA NA ...
$ Weight: num NA NA 57 NA 74 73 NA NA 57 NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
Conditional filtering
Now the stuff we looked at in logical operators comes in handy! We can filter rows which match some condition. For example, we might want to look at all athletes from Germany:
$Team == "Germany", ] athletes[athletes
NOC ID
107246 GER 7385
107247 GER 114424
107248 GER 112937
107249 GER 107870
107250 GER 9399
107252 GER 9398
107253 GER 47318
107254 GER 96348
107255 GER 127340
...
Take a close look at the comparison before the ,
:
$Team == "Germany" athletes
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
...
athletes$Team
is a vector, so comparing its values to a specified value yields a logical vector with the respective TRUE
and FALSE
values. We can insert this logical vector in front of the ,
to extract all rows corresponding to that condition.
If we want to extract multiple nationalities at once, we need the %in%
operator:
$Team %in% c("Kenya", "Norway"), ] athletes[athletes
NOC ID Name
156375 KEN 60617 Nixon Kiprotich
156376 KEN 85669 Benjamin Ngaruiya
156377 KEN 60620 Wilson Arap Chuma Kiprugut
156378 KEN 87843 Maisiba Obwoge
156379 KEN 20521 Charles Cheruiyot
156380 KEN 20524 Rose Jelagat Cheruiyot (-Kirui)
156381 KEN 60610 Asbel Kipruto Kiprop
156382 KEN 90229 Janet Owino Awour
156383 KEN 89053 Ahmed Rajab Omari
...
By the way, if we want to save our extracted data frame, we can assign it a new name (otherwise it will only get printed into the console, but we can’t go on working with it):
<- athletes[athletes$Team %in% c("Kenya", "Norway"), ]
athletes_team head(athletes_team)
NOC ID Name Sex Age Height Weight Team
156375 KEN 60617 Nixon Kiprotich M 29 185 68 Kenya
156376 KEN 85669 Benjamin Ngaruiya M 24 NA NA Kenya
156377 KEN 60620 Wilson Arap Chuma Kiprugut M NA 178 71 Kenya
156378 KEN 87843 Maisiba Obwoge M 28 175 99 Kenya
156379 KEN 20521 Charles Cheruiyot M 19 165 54 Kenya
156380 KEN 20524 Rose Jelagat Cheruiyot (-Kirui) F 24 154 48 Kenya
Games Year Season City Sport
156375 1992 Summer 1992 Summer Barcelona Athletics
156376 1992 Summer 1992 Summer Barcelona Boxing
...
We can also combine multiple logical vectors using &
(“and”) and |
(“or”). For example, we might want to look at all german athletes before the year 2000:
<- athletes[athletes$Team == "Germany" & athletes$Year < 2000, ]
athletes_2 head(athletes_2)
NOC ID Name Sex Age Height Weight Team
107246 GER 7385 Dirk Peter Balster M 26 195 90 Germany
107247 GER 114424 Kathleen Stark (-Kern) F 16 166 51 Germany
107250 GER 9399 Petra Behle-Schaaf F 23 177 67 Germany
107252 GER 9398 Jochen Friedrich Behle M 37 183 73 Germany
107253 GER 47318 Martin Heinze M 21 172 73 Germany
107254 GER 96348 Ramona Portwich F 25 175 70 Germany
Games Year Season City Sport
107246 1992 Summer 1992 Summer Barcelona Rowing
107247 1992 Summer 1992 Summer Barcelona Gymnastics
107250 1992 Winter 1992 Winter Albertville Biathlon
107252 1998 Winter 1998 Winter Nagano Cross Country Skiing
107253 1960 Summer 1960 Summer Roma Wrestling
107254 1992 Summer 1992 Summer Barcelona Canoeing
Event Medal Region
107246 Rowing Men's Coxless Fours <NA> Germany
107247 Gymnastics Women's Uneven Bars <NA> Germany
107250 Biathlon Women's 7.5 kilometres Sprint <NA> Germany
107252 Cross Country Skiing Men's 10 kilometres <NA> Germany
107253 Wrestling Men's Welterweight, Freestyle <NA> Germany
107254 Canoeing Women's Kayak Doubles, 500 metres Gold Germany
Or at all judo athletes weighting over 100 or under 50 kg:
<- athletes[(athletes$Sport == "Judo") & (athletes$Weight > 100 | athletes$Weight < 50), ]
athletes_3 head(athletes_3)
NOC ID Name Sex Age Height Weight Team Games
NA <NA> NA <NA> <NA> NA NA NA <NA> <NA>
NA.1 <NA> NA <NA> <NA> NA NA NA <NA> <NA>
NA.2 <NA> NA <NA> <NA> NA NA NA <NA> <NA>
NA.3 <NA> NA <NA> <NA> NA NA NA <NA> <NA>
471 ALG 13895 Mohamed Bouaichaoui M 25 178 120 Algeria 2004 Summer
NA.4 <NA> NA <NA> <NA> NA NA NA <NA> <NA>
Year Season City Sport Event Medal Region
NA NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.1 NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.2 NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.3 NA <NA> <NA> <NA> <NA> <NA> <NA>
471 2004 Summer Athina Judo Judo Men's Heavyweight <NA> Algeria
NA.4 NA <NA> <NA> <NA> <NA> <NA> <NA>
Hmm, that looks a bit weird. Some rows only contain NA
values. That’s because there are missing values in the Weight
column. We will look at that closer in the missings chapter and ignore that problem for now.
In the long run, always having to specify the name of the data frame for each column or row with condition can become a bit annoying and clutters the code. Also this code leaves all rows with missing values…
Instead, we can use the filter()
function from the tidyverse:
Rows: Tidyverse
library(tidyverse)
%>%
athletes filter(Sport == "Judo", (Weight > 100 | Weight < 50))
NOC ID Name Sex Age Height Weight
1 ALG 13895 Mohamed Bouaichaoui M 25 178 120.0
2 ALG 82643 Meriem Moussa F 20 150 48.0
3 ALG 80035 Boualem Miloudi M 23 192 106.0
...
Note how we can just write our conditions without connecting them with &
(filter()
does that automatically for us). Also, we don’t have to put the column names into ""
, because filter()
knows that this are column names of the athletes
data frame, which makes coding a bit more pleasant. And finally, missing rows are automatically removed, which makes sense in many cases!
Columns: Tidyverse
For extracting columns, we need select()
:
%>%
athletes select(Year, Sport)
Year Sport
1 1956 Hockey
2 1948 Hockey
3 1980 Wrestling
...
Vectors
Let’s take a quick look at how to extract elements from a vector, which shouldn’t be a problem after already dealing with data frames. It’s pretty straight forward: we just put the position of the element we want to extract behind the vector in square brackets (without a ,
, as we only have a one dimensional object). Let’s quickly define a vector for illustration:
<- athletes$Sport # remember: `$` returns a vector vec_sport
And look at the second element:
2] vec_sport[
[1] "Hockey"
Of course we can also do that for multiple elements:
c(2, 3, 4)] vec_sport[
[1] "Hockey" "Wrestling" "Hockey"
## Or, less to write:
2:4] vec_sport[
[1] "Hockey" "Wrestling" "Hockey"
Another way would be to provide a logical vector, which defines for each position if we want to extract the element or not (like we already did for data frames):
c(rep(TRUE, 100), rep(FALSE, 65))] vec_sport[
[1] "Hockey" "Hockey"
[3] "Wrestling" "Hockey"
[5] "Wrestling" "Wrestling"
[7] "Hockey" "Hockey"
...
Lists
When subsetting lists we have two options:
# Define an example list:
<- list(
show_list "TV-Show" = c("Friends", "How I Met Your Mother"),
"dat" = data.frame(
"name" = c("Monica", "Ted"),
"age" = c(24, 27)
) )
- We can extract a list element. This is done by single square brackets:
str(show_list[2])
List of 1
$ dat:'data.frame': 2 obs. of 2 variables:
..$ name: chr [1:2] "Monica" "Ted"
..$ age : num [1:2] 24 27
Note how the result is still a list
? It’s like taking out a drawer from a closet, but keeping the content inside this drawer.
- We can extract the element that is stored inside the list element. This is done by double square brackets:
str(show_list[[2]])
'data.frame': 2 obs. of 2 variables:
$ name: chr "Monica" "Ted"
$ age : num 24 27
Here the result is the data frame
that was saved inside the list. It’s like taking the content out of the drawer.