# install.packages("tidyverse")
# install.packages("here")
library(tidyverse)
library(here)
<- readRDS(file = here::here("raw_data", "athletes.rds")) athletes
Merging data
Data set
In the end, we want to plot the number of gold medals the countries have won on a world map. To do that, we need a data set containing coordinates of the different countries. Luckily, ggplot2
(part of the tidyverse
) provides a fitting data set. Let’s download it and load it into R:
<- readRDS(file = here::here("raw_data", "world_coordinates.rds")) world_coordinates
Before merging
Right now, we have multiple rows for each country in both data sets. This will not be merged easily, so we have to reduce our athletes data first. We need to calculate how many gold medals each country has won in total. Let’s do that quickly, using some tidyverse
functions. It is not especially important you understand and know everything that happens here, but we need it for the next chapters, so here it goes:
<- athletes %>%
medal_counts filter(Medal == "Gold") %>%
group_by(Region) %>%
count(Medal)
medal_counts
# A tibble: 99 × 3
# Groups: Region [99]
Region Medal n
<chr> <chr> <int>
1 Algeria Gold 5
2 Argentina Gold 91
3 Armenia Gold 2
4 Australia Gold 368
5 Austria Gold 108
6 Azerbaijan Gold 7
7 Bahamas Gold 14
8 Bahrain Gold 1
9 Belarus Gold 24
10 Belgium Gold 98
# ℹ 89 more rows
What happens here? We extract all rows containing gold medals, group them by region, so our next operation is performed region wise, and not for the whole data set. Then we count how many gold medals each region got.
Merging
To merge two data frames that include information that belongs together, we need a common column, on which we can combine them. In our case, this is the column containing the country. They are both named region
, but one with an upper case R
. This doesn’t pose a problem, as we can define which columns should be taken from which data frame for merging. Let’s take a quick look before merging to check if there are any countries named differently in both data sets (this simply combines commands we have already looked at in the Basic operations chapter:
$Region[!(medal_counts$Region %in% world_coordinates$region)] medal_counts
[1] "Individual Olympic Athletes"
Looks like all of the countries in our medal_countries
data frame can also be found in our world_coordinates
frame. Only athletes without a country will be lost when merging, but that’s ok for now, as we are interested in the country specific gold medal counts. So let’s merge:
<- merge(
medal_countries x = medal_counts,
y = world_coordinates,
by.x = "Region",
by.y = "region",
all.x = FALSE,
all.y = TRUE
)
head(medal_countries)
Region Medal n long lat group order subregion
1 Afghanistan <NA> NA 74.89131 37.23164 2 12 <NA>
2 Afghanistan <NA> NA 74.84023 37.22505 2 13 <NA>
3 Afghanistan <NA> NA 74.76738 37.24917 2 14 <NA>
4 Afghanistan <NA> NA 74.73896 37.28564 2 15 <NA>
5 Afghanistan <NA> NA 74.72666 37.29072 2 16 <NA>
6 Afghanistan <NA> NA 74.66895 37.26670 2 17 <NA>
Note that we also used the all.x
and all.y
arguments. In this example, we want to take all rows from the second data set, but only those from the first data set, that have a match in the second data set. This is necessary, because we want to plot all countries later on, but only those we have coordinates for, because they won’t show up on the map otherwise.
We can also use the tidyverse for this operation. In order to do that, we first have to rename our region column, as the column names need to be the same over both data sets that are merged. left_join()
means that we will merge onto the first data set (world_coordinates in the code below), like we have done using the all.x
and all.y
arguments in the merge()
function.
<- world_coordinates %>%
medal_countries rename("Region" = region) %>%
left_join(medal_counts)
Joining with `by = join_by(Region)`
head(medal_countries)
long lat group order Region subregion Medal n
1 -69.89912 12.45200 1 1 Aruba <NA> <NA> NA
2 -69.89571 12.42300 1 2 Aruba <NA> <NA> NA
3 -69.94219 12.43853 1 3 Aruba <NA> <NA> NA
4 -70.00415 12.50049 1 4 Aruba <NA> <NA> NA
5 -70.06612 12.54697 1 5 Aruba <NA> <NA> NA
6 -70.05088 12.59707 1 6 Aruba <NA> <NA> NA
Great! Now the information that belongs together is stored together.