# install.packages("tidyverse")
# install.packages("here")
library(tidyverse)
library(here)
Reshaping
In this chapter, we will look at a simpler data set that makes it a bit easier to explain reshaping between the wide and long data format, as our athletes
data set is relatively complex.
Let’s define our own data set:
<- data.frame(
inhabitants_wide country = c("China", "India", "USA"),
inhabitants_2021 = c(1425893465, 1407563842, NA),
inhabitants_2022 = c(1425857720, 1420939232, 338903174)
)
Wide Format
In the Wide Format, we have only one row per unit of analysis. In this example, each country has it’s own row:
head(inhabitants_wide)
country inhabitants_2021 inhabitants_2022
1 China 1425893465 1425857720
2 India 1407563842 1420939232
3 USA NA 338903174
However, the variable inhabitants
is stretched over multiple rows. Depending on the use case it can make sense to reshape the data, so the inhabitants
are all put into one column:
Long Format
This is what happens in the Long Data Format, where each unit of analysis is spread over multiple rows.
head(inhabitants_long)
# A tibble: 6 × 3
country year inhabitants
<chr> <chr> <dbl>
1 China inhabitants_2022 1425857720
2 China inhabitants_2021 1425893465
3 India inhabitants_2022 1420939232
4 India inhabitants_2021 1407563842
5 USA inhabitants_2022 338903174
6 USA inhabitants_2021 NA
Reshaping from Wide to Long Format
To get from Wide Format to Long Format we can use the pivot_longer()
function from the tidyverse
:
<- inhabitants_wide %>%
inhabitants_long pivot_longer(
## Select the columns we want to reshape:
cols = c("inhabitants_2022", "inhabitants_2021"),
## Define a new column where the column names will go to:
names_to = "year",
## Define a new column where the values will go to:
values_to = "inhabitants"
)
head(inhabitants_long)
# A tibble: 6 × 3
country year inhabitants
<chr> <chr> <dbl>
1 China inhabitants_2022 1425857720
2 China inhabitants_2021 1425893465
3 India inhabitants_2022 1420939232
4 India inhabitants_2021 1407563842
5 USA inhabitants_2022 338903174
6 USA inhabitants_2021 NA
Reshaping from Long to Wide Format
In other cases, it might happen that multiple variables are put into the same column, together with an identifier column:
inhabitants_long_2
country variable value
1 China area 9597000
2 China inhabitants_2022 1425857720
3 India area 3287000
4 India inhabitants_2022 1420939232
5 USA area 9834000
6 USA inhabitants_2022 338903174
In that case it can make sense to spread the the distinct variables into two columns:
<- inhabitants_long_2 %>%
inhabitants_wide_2 pivot_wider(
id_cols = "country",
names_from = "variable",
values_from = "value"
)
inhabitants_wide_2
# A tibble: 3 × 3
country area inhabitants_2022
<chr> <dbl> <dbl>
1 China 9597000 1425857720
2 India 3287000 1420939232
3 USA 9834000 338903174