# install.packages("tidyverse")
# install.packages("here")
library(tidyverse)
library(here)
<- readRDS(file = here::here("raw_data", "athletes.rds")) athletes
Getting an overview
Before starting to do something with your data, it is always a good idea to get an overview. Our goal is to answer questions in the line of:
- Which variables does our data have?
- How many rows/columns does our data frame have? If we have a list, how long is it, what is saved within?
- What types do our variables have (are they numeric, character …)? Do we have to transform them before we can work with them?
- Do we have any missing values?
To answer these questions, we have different tools at our disposal:
View()
View()
will open the data set Excel-style in a new window:
View(athletes)
In this window we can sort and filter, which makes it a pretty useful tool.
head()
Head helps you to get an overview of the data frame, as it prints the first six rows into your console:
head(athletes)
NOC ID Name Sex Age Height Weight Team
1 AFG 132181 Najam Yahya M NA NA NA Afghanistan
2 AFG 87371 Ahmad Jahan Nuristani M NA NA NA Afghanistan
3 AFG 44977 Mohammad Halilula M 28 163 57 Afghanistan
4 AFG 502 Ahmad Shah Abouwi M NA NA NA Afghanistan
5 AFG 109153 Shakar Khan Shakar M 24 NA 74 Afghanistan
6 AFG 29626 Sultan Mohammad Dost M 28 168 73 Afghanistan
Games Year Season City Sport
1 1956 Summer 1956 Summer Melbourne Hockey
2 1948 Summer 1948 Summer London Hockey
3 1980 Summer 1980 Summer Moskva Wrestling
4 1956 Summer 1956 Summer Melbourne Hockey
5 1964 Summer 1964 Summer Tokyo Wrestling
6 1960 Summer 1960 Summer Roma Wrestling
Event Medal Region
1 Hockey Men's Hockey <NA> Afghanistan
2 Hockey Men's Hockey <NA> Afghanistan
3 Wrestling Men's Bantamweight, Freestyle <NA> Afghanistan
4 Hockey Men's Hockey <NA> Afghanistan
5 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
6 Wrestling Men's Welterweight, Freestyle <NA> Afghanistan
str()
This one is actually my favorite, as for bigger data sets it is often more feasible to only look at the structure and not the whole data set. It looks a bit different to what we are used to though:
str(athletes)
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: int NA NA 163 NA NA 168 NA NA NA NA ...
$ Weight: num NA NA 57 NA 74 73 NA NA 57 NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
Here, the column names are printed on the left side, followed by the type of the column and then the first few values of each column. We can also see at the top that this object is a data frame with 270767 rows and 16 columns.
summary()
Finally, to get a more thourough overview of our variables, we can use summary()
:
summary(athletes)
NOC ID Name Sex
Length:270767 Min. : 1 Length:270767 Length:270767
Class :character 1st Qu.: 34630 Class :character Class :character
Mode :character Median : 68187 Mode :character Mode :character
Mean : 68229
3rd Qu.:102066
Max. :135571
Age Height Weight Team
Min. :10.00 Min. :127.0 Min. : 25.00 Length:270767
1st Qu.:21.00 1st Qu.:168.0 1st Qu.: 60.00 Class :character
Median :24.00 Median :175.0 Median : 70.00 Mode :character
Mean :25.56 Mean :175.3 Mean : 70.71
3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 79.00
Max. :97.00 Max. :226.0 Max. :214.00
NA's :9462 NA's :60083 NA's :62785
Games Year Season City
Length:270767 Min. :1896 Length:270767 Length:270767
Class :character 1st Qu.:1960 Class :character Class :character
Mode :character Median :1988 Mode :character Mode :character
Mean :1978
3rd Qu.:2002
Max. :2016
Sport Event Medal Region
Length:270767 Length:270767 Length:270767 Length:270767
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
For numeric columns we get their minimum and maximum, median and mean, as well as the first and third quantile. In case of missing values (NAs
) their number is printed at the bottom (e.g., look at the Age
column). We will look at how to deal with missings soon, but first we have to talk about subsetting data.