Data Structures

There are five main data structures in R which differ on their dimensions (one dimension, two dimensions, n dimensions) and the type of the elements they are containing (same type, different types):¹

	Homogeneous	Heterogeneous
1d	atomic vector	list
2d	matrix	data.frame
nd	array

Let’s take a closer look at the two we will use mostly throughout this workshop:

Vector

Atomic vectors (from here on only called vectors) contain elements of only the same type:

num_vec <- c(2023, 8, 8)
num_vec

[1] 2023    8    8

char_vec <- c("This", "is", "a", "vec", ".")
char_vec

[1] "This" "is"   "a"    "vec"  "."

log_vec <- c(TRUE, FALSE)
log_vec

[1]  TRUE FALSE

Data types

If we take a look at the structure of the vectors we have just created, we see se a short description of the data type we are dealing with in front of the vector:

str(num_vec)

 num [1:3] 2023 8 8

str(char_vec)

 chr [1:5] "This" "is" "a" "vec" "."

str(log_vec)

 logi [1:2] TRUE FALSE

The first one is num (numeric) so it only stores numeric values. The second one is char (character), so it only can contain strings. And last but not least we have logi (logical) for boolean values. Why is that important? Well, some functions only make sense for specific data types. For example:

mean(char_vec)

Warning in mean.default(char_vec): argument is not numeric or logical:
returning NA

[1] NA

gives us a warning, because the input has the wrong format.

By the way, strings are just ‘words’ combined of multiple characters. We can combine multiple strings by using paste() or paste0() (the first one leaves a space between the words, the second one not):

vec_1 <- "My value"
vec_2 <- "is:"
value <- 10


paste(vec_1, vec_2, value)

[1] "My value is: 10"

paste0(vec_1, vec_2, value)

[1] "My valueis:10"

This will come in handy later when we write our own functions, because it helps us to print variable messages, depending on the input given by the user.

Data frame

A data frame is two dimensional and can store elements of different types.

persons <- data.frame(name = c("Anna", "Alex", "John", "Jessi"),
                      age = c(19, 17, 18, 18),
                      birth_month = c("Jan", "Sep", "Oct", "Mar"),
                      big5_extro = c(3.5, 2, 4.5, 4.2)
                      )

Note that we do nothing else here than combining vectors to a data frame. Each vector will be one column, with an assigned column name.

Adding new columns

Adding new columns to a data frame is pretty straight forward. We just define the column name, and then assign it some input. For example, we could add a column with the neuroticsm ratings for each person:

persons$big5_neuro <- c(1, 3, 2, 4)
persons

   name age birth_month big5_extro big5_neuro
1  Anna  19         Jan        3.5          1
2  Alex  17         Sep        2.0          3
3  John  18         Oct        4.5          2
4 Jessi  18         Mar        4.2          4

Or, using the tidyverse with the help of mutate():

library(tidyverse)

persons <- persons %>% 
  mutate(big5_agree= c(2, 5, 2, 1) )

Tibbles

A special type of data frame are the so called tibbles. Tibbles are a modern version of data frames and the standard data frame type of the tidyverse, as they have some advantageous characteristics (e.g., note the more informative printing of the data frame). So don’t be confused if you run into them, in general they behave like data frames.

persons_tibble <- tibble(
  name = c("Anna", "Alex", "John", "Jessi"),
  age = c(19, 17, 18, 18),
  birth_month = c("Jan", "Sep", "Oct", "Mar"),
  big5_extro = c(3.5, 2, 4.5, 4.2)
)
persons_tibble

# A tibble: 4 × 4
  name    age birth_month big5_extro
  <chr> <dbl> <chr>            <dbl>
1 Anna     19 Jan                3.5
2 Alex     17 Sep                2  
3 John     18 Oct                4.5
4 Jessi    18 Mar                4.2

List

A list is a one dimensional object, which can, unlike a vector, contain elements of different types, but also of different lengths. For example, we can store a vectors of different lengths and data frames in a list, which makes it the most versatile data structure:

personality_rating <- list(
     big5 = data.frame(name = c("Jessi", "John"),
                       extraversion = c(4.3, 2), 
                       openness = c(3.8, NA)),
     rating_type = "self_rating"
     )
personality_rating

$big5
   name extraversion openness
1 Jessi          4.3      3.8
2  John          2.0       NA

$rating_type
[1] "self_rating"

Here, we define the list personality_ratings, which includes a data frame with the personality rating, and some meta information in the form of a character vector, describing the rating type. We won’t use it much in this workshop, but keep in mind it exists, as it quickly becomes necessary for managing more complex tasks.

Matrix & Array

Finally, just for the sake of comprehensiveness (we won’t use them in the following workshop, but that doesn’t mean they are irrelevant):

my_matrix <- matrix(c(1,2,"3",4), 
                    nrow = 2, 
                    ncol = 2
                    )

my_matrix

     [,1] [,2]
[1,] "1"  "3" 
[2,] "2"  "4"

Note how everything gets converted to character (with the “” around it), because we used a "3" instead of 3? That’s because a matrix can only have values of the same type.

Last but not least, just so you have seen it once:

my_array <- array(1:24, dim = c(2, 3, 4))
my_array

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

, , 3

     [,1] [,2] [,3]
[1,]   13   15   17
[2,]   14   16   18

, , 4

     [,1] [,2] [,3]
[1,]   19   21   23
[2,]   20   22   24

By using the dim argument I specify that each matrix in this array has 2 rows, 3 columns, and that I want 4 matrices.

Footnotes

Table from Advanced R.↩︎