+ - 0:00:00
Notes for current slide
Notes for next slide

Microbes on Earth





R session 02 - Data wrangling

Daniel Vaulot

2021-10-12

1 / 36

Outline

  • Data frames
  • Concept of tidy data
  • Reading data
  • Manipulating data
    • Columns
    • Rows
2 / 36

Data frames

What is it ?

  • Table mixing different types of columns (an Excel table...)
  • However within a column all values are similar, e.g. numeric, logical, character
3 / 36

Data frames

What is it ?

  • Table mixing different types of columns (an Excel table...)
  • However within a column all values are similar, e.g. numeric, logical, character
df <- data.frame(label = letters[1:6],
id = 1:6,
value = rnorm(6, mean = 0, sd = 1),
flag=c(TRUE, FALSE), # recycling rule
stringsAsFactors = FALSE)
df
label id value flag
1 a 1 1.2682572 TRUE
2 b 2 1.8706980 FALSE
3 c 3 -1.0296620 TRUE
4 d 4 0.8709814 FALSE
5 e 5 -1.1437683 TRUE
6 f 6 0.6079145 FALSE
  • We will NOT use factors: stringsAsFactors = FALSE (default in R > 4.0)
3 / 36

Data frames

Useful functions

dim(df) # returns the dimensions of data frame
[1] 6 4
nrow(df) # number of rows
[1] 6
ncol(df) # number of columns
[1] 4
4 / 36

Data frames

Access specific value

  • Use the df[i,j] notation, first index corresponds to row, second index to column
df[5,3]
[1] -1.143768
5 / 36

Data frames

Access specific value

  • Use the df[i,j] notation, first index corresponds to row, second index to column
df[5,3]
[1] -1.143768
  • Specify the name of the column
df[5,"value"]
[1] -1.143768
  • The result is a vector
5 / 36

Data frames

Access specific column

  • Use the df[i,j] notation
df[,3]
[1] 1.2682572 1.8706980 -1.0296620 0.8709814 -1.1437683 0.6079145
df[,"value"]
[1] 1.2682572 1.8706980 -1.0296620 0.8709814 -1.1437683 0.6079145
  • The result is a vector
6 / 36

Data frames

This can be used to access a specific value

  • $ for the column, [i] for the row
df$value[5]
[1] -1.143768
7 / 36

Data frames

Access row

  • Use the df[i,j] notation
df[1,]
label id value flag
1 a 1 1.268257 TRUE
  • The result is a data frame
8 / 36

Data frames

Access specific rows

  • Rows for which the value of id <= 3
df[df$id <= 3,]
label id value flag
1 a 1 1.268257 TRUE
2 b 2 1.870698 FALSE
3 c 3 -1.029662 TRUE

Select lines for which the label is c

9 / 36

Data frames

Access specific rows

  • Rows for which the value of id <= 3
df[df$id <= 3,]
label id value flag
1 a 1 1.268257 TRUE
2 b 2 1.870698 FALSE
3 c 3 -1.029662 TRUE

Select lines for which the label is c

df[df$label == "c",]
label id value flag
3 c 3 -1.029662 TRUE

This syntax is complicated - tidyverse packages make it much more easy to manipulate and remember

9 / 36

Tidy data

Installation and Resources

Packages

  • readxl : Reading Excel files
  • readr : Reading and writing Text files
  • dplyr : Filter and reformat data frames
  • tidyr : Make data "tidy"
  • stringr : Manipulating strings
  • lubridate : Manipulate date
  • unzip R-session-02.zip
  • Open in R data_wrangling.R

Resources

10 / 36

Tidy data

Basic concepts

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

11 / 36

Tidy data

Load necessary libraries

library("readxl") # Import the data from Excel file
library("readr") # Import the data from Excel file
library("dplyr") # filter and reformat data frames
library("tidyr") # make data tidy
library("stringr") # manipulate strings
library("lubridate") # manipulate date
library("ggplot2") # graphics
12 / 36

Oceanographic data

CARBOM cruise off Brazil

13 / 36

Oceanographic data

CARBOM cruise off Brazil

  • Stations
  • Depth
  • Coordinates
  • Temperature, Salinity
  • Nitrates, Phosphates

13 / 36

Oceanographic data

Microbial populations

  • Flow cytometry :
    • pico-eukaryotes
    • nano-eukaryotes
14 / 36

Oceanographic data

Microbial populations

  • Flow cytometry :
    • pico-eukaryotes
    • nano-eukaryotes

14 / 36

Read data

Text file - TAB delimited

15 / 36

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
16 / 36

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 13/11/2013 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 13/11/2013 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 18/11/2013 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 18/11/2013 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 18/11/2013 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 18/11/2013 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 18/11/2013 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 18/11/2013 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 13/11/2013 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 18/11/2013 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
16 / 36

Read data

Reading a text file

samples <- readr::read_tsv("data/CARBOM data.txt")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 13/11/2013 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 13/11/2013 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 18/11/2013 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 18/11/2013 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 18/11/2013 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 18/11/2013 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 18/11/2013 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 18/11/2013 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 13/11/2013 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 18/11/2013 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • readr::read_tsv() : read tab delimited files
  • readr::read_csv() : read comma delimited files

  • readr::write_tsv() : write tab delimited files

16 / 36

Read data

Excel sheet

17 / 36

Read data

Read the data - read_excel

samples <- readxl::read_excel("data/CARBOM data.xlsx",
sheet = "Samples_boat")
18 / 36

Read data

Read the data - read_excel

samples <- readxl::read_excel("data/CARBOM data.xlsx",
sheet = "Samples_boat")
sample number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
  • Can also select a range : e.g. A1:Q26
  • Can skip lines
18 / 36

Write data

Text file

  • readr::write_tsv() : write tab delimited files
readr::write_tsv(samples, "data/CARBOM data fixed.tsv")
19 / 36

Write data

Text file

  • readr::write_tsv() : write tab delimited files
readr::write_tsv(samples, "data/CARBOM data fixed.tsv")

Excel file

  • openxlsx::write.xlsx : write tab delimited files

  • Many options: specific sheet, formatting etc...

openxlsx::write.xlsx(samples, "data/CARBOM data fixed.xlsx")
19 / 36

dplyr - Manipulate tables

@allison_horst

20 / 36

Manipulate columns

List columns

colnames(samples)
[1] "sample number" "transect" "station" "date" "time" "depth" "level" "latitude" "longitude" "picoeuks" "nanoeuks"
[12] "phosphates" "nitrates" "temperature" "salinity"
21 / 36

Manipulate columns

Summarize columns

summary(samples$depth)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5.0 5.0 50.0 56.6 100.0 140.0 3
22 / 36

Manipulate columns

Select specific columns - select

samples_select <- dplyr::select(samples, transect, `sample number`,
station, depth, latitude, longitude,
picoeuks, nanoeuks)
23 / 36

Manipulate columns

Select specific columns - select

samples_select <- dplyr::select(samples, transect, `sample number`,
station, depth, latitude, longitude,
picoeuks, nanoeuks)
transect sample number station depth latitude longitude picoeuks nanoeuks
1 10 81 140 -27.42 -44.72 3278 1232
1 11 85 110 -26.80 -45.30 16312 1615
2 120 96 5 -27.39 -47.82 1150 75
2 121 96 30 -27.39 -47.82 1737 218
2 122 96 50 -27.39 -47.82 853 234
2 125 98 5 -27.59 -47.39 3086 1300
2 126 98 50 -27.59 -47.39 1217 782
2 127 98 85 -27.59 -47.39 3420 226
1 13 86 105 -26.33 -45.41 6366 1007
2 140 101 5 -27.79 -46.96 500 366
  • Column names are not "quoted" (in base R you need to "quote" the column names)
  • Better not to put space in column header because then must enclose column name with ` (back-quote)
23 / 36

Manipulate columns

Select a range of columns - select

samples_select <- dplyr::select(samples, transect:nanoeuks)
24 / 36

Manipulate columns

Select a range of columns - select

samples_select <- dplyr::select(samples, transect:nanoeuks)
transect station date time depth level latitude longitude picoeuks nanoeuks
1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232
1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615
2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75
2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218
2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234
2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300
2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782
2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226
1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007
2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366
24 / 36

Manipulate columns

Unselect columns - select

samples_select <- dplyr::select (samples, -nitrates, -phosphates)
25 / 36

Manipulate columns

Unselect columns - select

samples_select <- dplyr::select (samples, -nitrates, -phosphates)
sample number transect station date time depth level latitude longitude picoeuks nanoeuks temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 23.1 33.5
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 22.6 33.7
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 23.1 35.7
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 23.7 37.2
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 23.5 36.5
25 / 36

Manipulate columns

Using the pipe operator - %>%

samples_select <- samples %>% dplyr::select(transect:nanoeuks)
26 / 36

Manipulate columns

Using the pipe operator - %>%

samples_select <- samples %>% dplyr::select(transect:nanoeuks)
transect station date time depth level latitude longitude picoeuks nanoeuks
1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232
1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615
2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75
2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218
2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234
2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300
2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782
2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226
1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007
2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366
  • It is cleaner to write on 2 lines
samples_select <- samples %>%
dplyr::select(transect:nanoeuks)
26 / 36

Manipulate columns

Renaming variables - rename

samples <- samples %>%
dplyr::rename(sample_number = `sample number`)
27 / 36

Manipulate columns

Renaming variables - rename

samples <- samples %>%
dplyr::rename(sample_number = `sample number`)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5
27 / 36

Manipulate columns

Creating new variables - mutate

samples <- samples %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
28 / 36

Manipulate columns

Creating new variables - mutate

samples <- samples %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5 93.87755
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 0.43 0.23 22.6 33.7 88.84910
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 0.56 0.21 20.3 35.9 78.47286
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7 70.36024
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 0.25 0.20 23.7 37.2 60.88044
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 0.25 0.47 22.9 37.0 93.80143
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5 57.73672
  • You can also use transmute() but then it will drop all the other columns.
  • It is much much better to compute new variables in R than in Excel, because you can easily track and correct errors.
28 / 36

Manipulate columns

Using the pipe operator you can chain operations

samples_select <- samples %>%
dplyr::select(sample_number:nanoeuks, level) %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
29 / 36

Manipulate columns

Using the pipe operator you can chain operations

samples_select <- samples %>%
dplyr::select(sample_number:nanoeuks, level) %>%
dplyr::mutate(pico_pct = picoeuks/(picoeuks+nanoeuks)*100)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks pico_pct
10 1 81 2013-11-13 1899-12-31 01:00:00 140 Deep -27.42 -44.72 3278 1232 72.68293
11 1 85 2013-11-13 1899-12-31 13:30:00 110 Deep -26.80 -45.30 16312 1615 90.99124
120 2 96 2013-11-18 1899-12-31 23:50:00 5 Surf -27.39 -47.82 1150 75 93.87755
121 2 96 2013-11-18 1899-12-31 23:50:00 30 Deep -27.39 -47.82 1737 218 88.84910
122 2 96 2013-11-18 1899-12-31 23:50:00 50 Deep -27.39 -47.82 853 234 78.47286
125 2 98 2013-11-18 1899-12-31 05:00:00 5 Surf -27.59 -47.39 3086 1300 70.36024
126 2 98 2013-11-18 1899-12-31 05:00:00 50 Deep -27.59 -47.39 1217 782 60.88044
127 2 98 2013-11-18 1899-12-31 05:00:00 85 Deep -27.59 -47.39 3420 226 93.80143
13 1 86 2013-11-13 1899-12-31 17:00:00 105 Deep -26.33 -45.41 6366 1007 86.34206
140 2 101 2013-11-18 1899-12-31 12:00:00 5 Surf -27.79 -46.96 500 366 57.73672
29 / 36

Manipulate columns

Creating labels with mutate and stringr functions

samples <- samples %>%
dplyr::mutate(sample_label = str_c("TR",transect,"St",station, sep="_"))
30 / 36

Manipulate columns

Creating labels with mutate and stringr functions

samples <- samples %>%
dplyr::mutate(sample_label = str_c("TR",transect,"St",station, sep="_"))
sample_number transect station date time sample_label
10 1 81 2013-11-13 1899-12-31 01:00:00 TR_1_St_81
11 1 85 2013-11-13 1899-12-31 13:30:00 TR_1_St_85
120 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
121 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
122 2 96 2013-11-18 1899-12-31 23:50:00 TR_2_St_96
125 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
126 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
127 2 98 2013-11-18 1899-12-31 05:00:00 TR_2_St_98
13 1 86 2013-11-13 1899-12-31 17:00:00 TR_1_St_86
140 2 101 2013-11-18 1899-12-31 12:00:00 TR_2_St_101
30 / 36

Manipulating rows

Order rows - arrange

samples <- samples %>%
dplyr::arrange(transect, station)
31 / 36

Manipulating rows

Order rows - arrange

samples <- samples %>%
dplyr::arrange(transect, station)
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
3 0 19 2013-11-02 13:30 5 Surf -25.79 -40.36 1005 898 0.29 0.48 22.7 36.9 52.81135
5 0 21 2013-11-02 0:0 5 Surf -26.23 -40.09 793 660 0.16 0.90 22.8 36.9 54.57674
7 0 26 2013-11-03 19:30 5 Surf -27.31 -39.38 907 856 0.20 0.50 21.2 36.4 51.44640
1 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7651 4845 0.47 1.07 19.7 36.3 61.22759
2 0 6 2013-10-31 5:20 45 Deep -23.58 -41.78 7343 3258 0.47 1.07 19.7 36.3 69.26705
10 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3278 1232 0.20 0.26 17.3 35.9 72.68293
9 1 81 2013-11-13 1:0 140 Deep -27.42 -44.72 3181 1235 0.20 0.26 17.3 35.9 72.03351
11 1 85 2013-11-13 13:30 110 Deep -26.80 -45.30 16312 1615 0.29 0.22 21.3 36.5 90.99124
13 1 86 2013-11-13 17:0 105 Deep -26.33 -45.41 6366 1007 0.34 0.15 20.9 36.3 86.34206
15 1 87 2013-11-13 19:30 105 Deep -26.22 -45.48 6189 622 0.47 1.51 19.5 36.1 90.86771
  • Station 6 is not ordered numerically. It is because station is a character column.
31 / 36

Manipulating rows

Summarize rows - count

  • Compute number of stations per transect
stations_count <- samples %>%
dplyr::count(transect)
32 / 36

Manipulating rows

Summarize rows - count

  • Compute number of stations per transect
stations_count <- samples %>%
dplyr::count(transect)
transect n
0 5
1 5
2 18
32 / 36

Manipulating rows

Summarize rows - group_by / summarize

  • Group by transect and station
  • Compute mean of the percent picoplankton
samples_mean <- samples %>%
dplyr::group_by(transect, station) %>%
dplyr::summarise(n_samples = n(),
mean_pico_percent = mean(pico_pct, na.rm=TRUE))
33 / 36

Manipulating rows

Summarize rows - group_by / summarize

  • Group by transect and station
  • Compute mean of the percent picoplankton
samples_mean <- samples %>%
dplyr::group_by(transect, station) %>%
dplyr::summarise(n_samples = n(),
mean_pico_percent = mean(pico_pct, na.rm=TRUE))
transect station n_samples mean_pico_percent
0 6 2 65.24732
0 19 1 52.81135
0 21 1 54.57674
0 26 1 51.44640
1 81 2 72.35822
1 85 1 90.99124
1 86 1 86.34206
1 87 1 90.86771
2 96 3 87.06651
2 98 3 75.01403
33 / 36

Manipulating rows

Filtering rows - filter

  • Get only the surface samples
samples_surf <- samples %>%
dplyr::filter(level == "Surf" )
34 / 36

Manipulating rows

Filtering rows - filter

  • Get only the surface samples
samples_surf <- samples %>%
dplyr::filter(level == "Surf" )
sample_number transect station date time depth level latitude longitude picoeuks nanoeuks phosphates nitrates temperature salinity pico_pct
3 0 19 2013-11-02 13:30 5 Surf -25.79 -40.36 1005 898 0.29 0.48 22.7 36.9 52.81135
5 0 21 2013-11-02 0:0 5 Surf -26.23 -40.09 793 660 0.16 0.90 22.8 36.9 54.57674
7 0 26 2013-11-03 19:30 5 Surf -27.31 -39.38 907 856 0.20 0.50 21.2 36.4 51.44640
120 2 96 2013-11-18 23:50 5 Surf -27.39 -47.82 1150 75 0.43 0.19 23.1 33.5 93.87755
125 2 98 2013-11-18 5:0 5 Surf -27.59 -47.39 3086 1300 0.29 0.25 23.1 35.7 70.36024
140 2 101 2013-11-18 12:0 5 Surf -27.79 -46.96 500 366 0.29 0.14 23.5 36.5 57.73672
155 2 106 2013-11-19 2:30 5 Surf -28.12 -46.17 355 18 0.25 0.37 23.0 36.9 95.17426
165 2 114 2013-11-19 21:40 5 Surf -28.65 -44.99 728 226 0.29 0.28 22.4 36.4 76.31027
Trichod.1 2 Surf -27.80 -47.10 1002 194 83.77926
Trichod.2 2 Surf -27.80 -47.10 744 206 78.31579
  • ! Use the logical operators == != > >= < <= is.na()
34 / 36

Recap

  • Import and Export data

  • Select and create columns

  • Summarize data

  • Joining

  • Long vs. Wide format

  • Displaying tables

35 / 36

Next time: Data visualization (ggplot2)

What you will learn :

  • Understand the "grammar" of graphics
  • Create exploratory graphics
  • Please install the following packages and their dependencies
    • ggplot2
  • Download data files (links on slack)

Reading list

36 / 36

Outline

  • Data frames
  • Concept of tidy data
  • Reading data
  • Manipulating data
    • Columns
    • Rows
2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow