Exam Preparation

Author
Affiliation

Ana Bendiek Laranjo

Published

June 13, 2024

Module 01: Getting Started with R

Introduction

Data

Data can be imported from many different sources. In this exercise, we import data from:

  1. an R Package that is loaded via the library() function.
library(ggplot2)
library(dplyr)
library(gapminder)
library(gt)

Gapminder data

For this analysis we’ll use the Gapminder dataset from the gapminder R package.

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Population

The table below shows a summary for the population grouped by continent.

gapminder |>    
  filter(year == 2007) |>    
  group_by(continent) |>   
  summarise(     
    mean_life_exp = mean(pop)   
    ) |>    
  gt()  
continent mean_life_exp
Africa 17875763
Americas 35954847
Asia 115513752
Europe 19536618
Oceania 12274974

Life expectancy

gapminder_2007 <- gapminder |> 
  filter(year == 2007)

ggplot(data = gapminder_2007, 
       mapping = aes(x = continent, 
                     y = lifeExp)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 1/4, size = 3) +
  labs(x = NULL,
       y = "life expectancy") +
  theme_minimal() 

Module 02a: Data visualization with ggplot2

Import

library(ggplot2)
library(ggthemes)
library(ggridges)
library(palmerpenguins)

Explore

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Visualize with ggplot2

Functions and arguments

  • functions: ggplot(), aes(), geom_point()
  • arguments: data, mapping, color
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

Aesthetic mappings

  • options: x, y, color, shape, size, alpha
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species,
                     shape = species)) +
  geom_point()

Settings

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species,
                     shape = species)) +
  geom_point(size = 5, alpha = 0.7)

Color scales

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species,
                     shape = species)) +
  geom_point(size = 5, alpha = 0.7) +
  scale_color_colorblind() 

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species,
                     shape = species)) +
  geom_point(size = 5, alpha = 0.7) +
  scale_color_manual(values = c("red", "blue", "green")) 

Facets

Keyboard shortcut for the tilde (~) varies by keyboard layout:

  • US keyboard Windows/Mac: Shift + ` (top left of your keyboard next to the 1)
  • UK keyboard Windows/Mac: Shift + # (bottom right of your keyboard, next to Enter)
  • CH keyboard Windows/Max: Alt/Option + -
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm,
                     y = body_mass_g)) +
  geom_point() +
  facet_grid(species ~ island)

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm,
                     y = body_mass_g)) +
  geom_point() +
  facet_wrap(~species)

Themes

Some code in this section is already prepared, we will add more code together.

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species,
                     shape = species)) +
  geom_point(size = 5, alpha = 0.7) +
  scale_color_colorblind() +
  theme_minimal()

Visualizing distributions

Categorical variables

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()

ggplot(data = penguins,
       mapping = aes(x = species,
                     fill = island)) +
  geom_bar()

Numerical continuous variables

The code in this section is already prepared, we will run through the code chunks together.

ggplot(data = penguins,
       mapping = aes(x = body_mass_g)) +
  geom_histogram()

ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     fill = species)) +
  geom_histogram()

ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     fill = species)) +
  geom_density()

ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     y = species,
                     fill = species)) +
  geom_density_ridges()

Module 02b: Working with R

Import

library(ggplot2)
library(dplyr)
library(gapminder)

Explore

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
tail(gapminder)
# A tibble: 6 × 6
  country  continent  year lifeExp      pop gdpPercap
  <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
1 Zimbabwe Africa     1982    60.4  7636524      789.
2 Zimbabwe Africa     1987    62.4  9216418      706.
3 Zimbabwe Africa     1992    60.4 10704340      693.
4 Zimbabwe Africa     1997    46.8 11404948      792.
5 Zimbabwe Africa     2002    40.0 11926563      672.
6 Zimbabwe Africa     2007    43.5 12311143      470.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
nrow(gapminder)
[1] 1704
ncol(gapminder)
[1] 6

Transform - Narrow down

gapminder_2007 <- gapminder |> 
  filter(year == 2007)
  • Keyboard shortcut for pipe operator: Ctrl / Cmd + Shift + M
  • Keyboard shortcut for assignment operator: Alt + -

Visualize

ggplot(data = gapminder_2007,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot()

Module 02c: Make a plot

Task 1: Import

The required packages for this homework exercise have already been added.

  1. Run the code chunk with the label ‘load-packages’ to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
library(gapminder)
library(ggplot2)
library(dplyr)

Task 2: Transform for data in 2007

Below is a typical task description as you will find them in the homework assignments. For “Fill in the gaps” tasks, you should replace the underscores ___ with the described code and then change the value of the code block option from false to true. In other tasks, you will create your own code from scratch. Over time, the task descriptions will be become less detailed.

Fill in the gaps

  1. A code chunk has already been created below.

  2. Start with the gapminder object and add the pipe operator at the end of the line.

  3. On a new line use the filter() function to narrow down the data for observation of the year 2007.

  4. Use the assignment operator to assign the data to an object named gapminder_2007.

  5. Run the code contained in the code chunk and fix any errors.

  6. Next to the code chunk option #| eval: change the value from false to true.

  7. Render the document and fix any errors.

gapminder_2007 <- gapminder |> 
  filter(year == 2007) 

Task 3: Create a boxplot

This is a typical task without any starter code.

  1. Add a new code chunk below point 5.

  2. Use the ggplot() function and the gapminder_2007 object to create a boxplot with the following aesthetic mappings:

  • continent to the x-axis;
  • life expectancy to the y-axis;
  • continent to color using the fill = continent argument inside aes()
  1. Run the code contained in the code chunk and fix any errors.

  2. What are the data types of the three variables used for aesthetic mappings?

ggplot(data = gapminder_2007,
       mapping = aes(x = continent,
                     y = lifeExp,
                     fill = continent)) +
  geom_boxplot()

Assignment 02: Data Visualisation

Task 1: Import

The required packages for this homework exercise have already been added.

  1. Run the code chunk with the label ‘load-packages’ to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
library(gapminder)
library(ggplot2)
library(dplyr)
library(readr)
library(sf)
library(rnaturalearth)

Task 2: Transform for data in 2007

Fill in the gaps

  1. A code chunk has already been created below.

  2. Start with the gapminder object and add the pipe operator at the end of the line.

  3. On a new line use the filter() function to narrow down the data for observation of the year 2007.

  4. Use the assignment operator to assign the data to an object named gapminder_2007.

  5. Run the code contained in the code chunk and fix any errors.

  6. Next to the code chunk option #| eval: change the value from false to true.

  7. Render the document and fix any errors.

gapminder_2007 <- gapminder |> 
  filter(year == 2007) 

Task 3: Summarize data for life expectancy by continent

Fill in the gaps

  1. A code chunk has already been created.

  2. Start with the gapminder_2007 object and add the pipe operator at the end of the line.

  3. On a new line use the group_by() function to group the operations that follow by continent. Add the pipe operator at the end of the line.

  4. On a new line use the summarise() function to calculate the number of observations (count) and median life expectancy.

  5. Use the assignment operator to assign the data to an object named gapminder_summary_2007.

  6. Run the code contained in the code chunk and fix any errors.

  7. Next to the code chunk option #| eval: change the value from false to true.

  8. Render the document and fix any errors.

gapminder_summary_2007 <- gapminder_2007 |> 
  group_by(continent) |> 
  summarise(
    count = n(),
    lifeExp = median(lifeExp)
  )

Task 4: Summarize data for life expectancy by continent and year

Fill in the gaps

  1. A code chunk has already been created.

  2. Start with the gapminder object and add the pipe operator at the end of the line.

  3. On a new line use the group_by() function to group the operations that follow by continent and year. Add the pipe operator at the end of the line.

  4. On a new line use the summarise() function to calculate and median life expectancy.

  5. Use the assignment operator to assign the data to an object named gapminder_summary_continent_year

  6. Run the code contained in the code chunk and fix any errors.

  7. Next to the code chunk option #| eval: change the value from false to true.

  8. Render the document and fix any errors.

gapminder_summary_continent_year <- gapminder |> 
  group_by(continent, year) |> 
  summarise(lifeExp = median(lifeExp)) 

Task 5: Data visualization

Thank you for working through the previous tasks. We are convinced that you have done a great job, but because the task descriptions aren’t always unambiguous, we have imported the data that we would have expected to be created and stored in the objects gapminder_2007, gapminder_summary_2007 and gapminder_summary_continent_year at the previous code chunks. This is to ensure that you can work through the following tasks.

  1. Run the code contained in the code chunk below to import the data.
gapminder_2007 <- read_rds(here::here("/cloud/project/data/gapminder-2007.rds"))

gapminder_summary_2007 <- read_rds(here::here("/cloud/project/data/gapminder-summary-2007.rds"))

gapminder_summary_continent_year <- read_rds(here::here("/cloud/project/data/gapminder-summary-continent-year.rds"))

Task 6: Create a boxplot

  1. A code chunk has already been created.

  2. Use the ggplot() function and the gapminder_2007 object to create a boxplot with the following aesthetic mappings:

  • continent to the x-axis;
  • life expectancy to the y-axis;
  • continent to color using the fill = continent argument inside aes()
  1. Do not display (ignore) the outliers in the plot. Note: Use a search engine or an AI tool to find the solution and add the link to the solution you have found.

  2. Run the code contained in the code chunk and fix any errors.

  3. What are the data types of the three variables used for aesthetic mappings?

ggplot(data = gapminder_2007,
       mapping = aes(x = continent,
                     y = lifeExp,
                     fill = continent)) +
  geom_boxplot(outlier.shape = NA) 

Task 7: Create a timeseries plot

  1. A code chunk has already been created.

  2. Use the ggplot() function and the gapminder_summary_continent_year object to create a connected scatterplot (also called timeseries plot) using the geom_line() and geom_point()functions with the following aesthetic mappings:

  • year to the x-axis;
  • life expectancy to the y-axis;
  • continent to color using the color = continent argument inside aes()
  1. Run the code contained in the code chunk and fix any errors.
ggplot(data = gapminder_summary_continent_year,
       mapping = aes(x = year,
                     y = lifeExp,
                     color = continent)) +
  geom_line() +
  geom_point() 

Task 8: Create a barplot

with geom_col()

  1. A code chunk has already been created.

  2. Use the ggplot() function and the gapminder_summary_2007 object to create a barplot using the geom_col() function with the following aesthetic mappings:

  • continent to the x-axis;
  • count to the y-axis;
  1. Run the code contained in the code chunk and fix any errors.
ggplot(data = gapminder_summary_2007,
       mapping = aes(x = continent,
                     y = count)) +
  geom_col()

with geom_bar()

  1. A code chunk has already been created.

  2. Use the ggplot() function and the gapminder_2007 object to create a barplot using the geom_bar() function with the following aesthetic mappings:

  • continent to the x-axis;
  1. Run the code contained in the code chunk and fix any errors.

  2. The plot is identical to the plot created with geom_col(). Why? What does the geom_bar() function do? Write your text here:

ggplot(data = gapminder_2007,
       mapping = aes(x = continent)) +
  geom_bar()

Task 9: Create a histogram

  1. A code chunk has already been created.

  2. Use the ggplot() function and the gapminder_2007 object to create a histogram using the geom_histogram() function with the following aesthetic mappings:

  • life expectancy to the x-axis;
  • continent to color using the fill = continent argument inside aes()
  1. Run the code contained in the code chunk and fix any errors.

  2. Inside the geom_histogram() function, add the following arguments and values:

  • col = "grey30"
  • breaks = seq(40, 85, 2.5)
  1. Run the code contained in the code chunk and fix any errors.

  2. Describe how the geom_histogram() function is similar to the geom_bar() function.

  3. What happens by adding the ‘breaks’ argument? Play around with the numbers inside of seq() to see what changes. Describe here what you observe:

ggplot(data = gapminder_2007,
       mapping = aes(x = lifeExp, 
                     fill = continent)) +
  geom_histogram(col = "grey30", breaks = seq(40, 85, 2.5)) 

Task 10: Scatterplot and faceting

  1. A code chunk has already been created.

  2. Use the ggplot() function and assign gapminder_2007 and create a scatterplot using the geom_point()function with the following aesthetic mappings:

  • gdpPercap the x-axis;
  • lifeExp to the y-axis;
  • population to the size argument;
  • country to color using the color = continent argument inside aes()
  1. Run the code contained in the code chunk and fix any errors.

  2. Use the variable continent to facet the plot by adding: facet_wrap(~continent).

  3. Run the code contained in the code chunk and fix any errors.

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     size = pop,
                     color = country)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~continent) 

Task 11: Create a lineplot and use facets

  1. A code chunk with complete code has already been prepared.

  2. Run the code contained in the code chunk and fix any errors.

  3. Remove the ‘#’ sign at the line that starts with the scale_color_manual() function

  4. What is stored in the country_colors object? Find out by executing the object in the Console (type it to the Console and hit enter). Do the same again, but with a question mark ?country_colors.

  5. Next to the code chunk option #| eval: change the value from false to true.

  6. Render the document and fix any errors.

ggplot(data = gapminder,
       mapping = aes(x = year, 
                     y = lifeExp, 
                     group = country, 
                     color = country)) +
  geom_line(lwd = 1, show.legend = FALSE) + 
  facet_wrap(~continent) +
  # scale_color_manual(values = country_colors) +
  theme_minimal() 

Task 12: Create a choropleth Maps

You can also prepare maps with ggplot2. It’s beyond the scope of the class to teach you the foundations of spatial data in R, but a popular package to work with spatial data is the sf (Simple Features) R Package. The rnaturalearth R Package facilitates world mapping by making Natural Earth map data more easily available to R users.

The code chunk below contains code for a world map that shows countries by income group. To view the map, do the following:

  1. Run the code contained in the code chunk and fix any errors.

  2. Next to the code chunk option #| eval: change the value from false to true.

  3. Render the document and fix any errors.

world <- ne_countries(scale = "small", returnclass = "sf")

world |> 
  mutate(income_grp = factor(income_grp, ordered = T)) |> 
  ggplot(aes(fill = income_grp)) + 
  geom_sf() +
  theme_void() +
  theme(legend.position = "top") +
  labs(fill = "Income Group:") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

The code for the code chunk is taken from here: More here: https://bookdown.org/alhdzsz/data_viz_ir/maps.html

Working with spatial data in R

If you are interested in working with spatial data in R, then we recommend the following resources for further study:

  • Geocompuation with R - Book: https://geocompr.robinlovelace.net/
  • Simple Features for R - Article: https://r-spatial.github.io/sf/articles/sf1.html
  • tmap: thematic maps in R - R Package: https://r-tmap.github.io/tmap/

Module 03a: Data transformation with dplyr

library(readr)
library(dplyr)

Import

In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.

The data set is available in the data folder as a CSV file named jmp_wld_sanitation_long.csv.

The data set contains the following variables:

  • name: country name
  • iso3: ISO3 country code
  • year: year of observation
  • region_sdg: SDG region
  • residence: residence type (national, rural, urban)
  • varname_short: short variable name (JMP naming convention)
  • varname_long: long variable name (JMP naming convention)

We use the read_csv() function to import the data set into R.

sanitation <- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv")

Explore

sanitation
# A tibble: 73,710 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 3 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 5 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 6 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 7 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 8 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 9 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
10 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
# ℹ 73,700 more rows
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg    <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long  <chr> "basic sanitation services", "basic sanitation services"…
$ residence     <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent       <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…

Transform with dplyr

The dplyr R Package aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with:

  • Rows
  • Columns
  • Groups of rows

filter()

The function filter() chooses rows based on column values. To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.

R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).

sanitation |> 
  filter(residence == "national")
# A tibble: 24,570 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 3 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 4 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
 5 Afghanis… AFG    2000 Central a… san_sm        safely mana… national    NA   
 6 Afghanis… AFG    2001 Central a… san_bas       basic sanit… national    21.9 
 7 Afghanis… AFG    2001 Central a… san_lim       limited san… national     5.66
 8 Afghanis… AFG    2001 Central a… san_unimp     unimproved … national    46.7 
 9 Afghanis… AFG    2001 Central a… san_od        no sanitati… national    25.8 
10 Afghanis… AFG    2001 Central a… san_sm        safely mana… national    NA   
# ℹ 24,560 more rows
sanitation |> 
  filter(residence != "national")
# A tibble: 49,140 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 3 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 5 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 6 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
 7 Afghanis… AFG    2000 Central a… san_od        no sanitati… rural       31.3 
 8 Afghanis… AFG    2000 Central a… san_od        no sanitati… urban        6.51
 9 Afghanis… AFG    2000 Central a… san_sm        safely mana… rural       NA   
10 Afghanis… AFG    2000 Central a… san_sm        safely mana… urban       NA   
# ℹ 49,130 more rows
sanitation |> 
  filter(residence == "national", iso3 == "SEN") 
# A tibble: 105 × 8
   name    iso3   year region_sdg   varname_short varname_long residence percent
   <chr>   <chr> <dbl> <chr>        <chr>         <chr>        <chr>       <dbl>
 1 Senegal SEN    2000 Sub-Saharan… san_bas       basic sanit… national     37.5
 2 Senegal SEN    2000 Sub-Saharan… san_lim       limited san… national     10.8
 3 Senegal SEN    2000 Sub-Saharan… san_unimp     unimproved … national     27.4
 4 Senegal SEN    2000 Sub-Saharan… san_od        no sanitati… national     24.4
 5 Senegal SEN    2000 Sub-Saharan… san_sm        safely mana… national     14.0
 6 Senegal SEN    2001 Sub-Saharan… san_bas       basic sanit… national     38.4
 7 Senegal SEN    2001 Sub-Saharan… san_lim       limited san… national     11.0
 8 Senegal SEN    2001 Sub-Saharan… san_unimp     unimproved … national     26.8
 9 Senegal SEN    2001 Sub-Saharan… san_od        no sanitati… national     23.7
10 Senegal SEN    2001 Sub-Saharan… san_sm        safely mana… national     14.3
# ℹ 95 more rows
sanitation |> 
  filter(iso3 == "UGA" | iso3 == "PER" | iso3 == "IND") 
# A tibble: 945 × 8
   name  iso3   year region_sdg     varname_short varname_long residence percent
   <chr> <chr> <dbl> <chr>          <chr>         <chr>        <chr>       <dbl>
 1 India IND    2000 Central and S… san_bas       basic sanit… national   15.0  
 2 India IND    2000 Central and S… san_bas       basic sanit… rural       2.25 
 3 India IND    2000 Central and S… san_bas       basic sanit… urban      48.4  
 4 India IND    2000 Central and S… san_lim       limited san… national    5.15 
 5 India IND    2000 Central and S… san_lim       limited san… rural       0.515
 6 India IND    2000 Central and S… san_lim       limited san… urban      17.3  
 7 India IND    2000 Central and S… san_unimp     unimproved … national    5.73 
 8 India IND    2000 Central and S… san_unimp     unimproved … rural       5.00 
 9 India IND    2000 Central and S… san_unimp     unimproved … urban       7.65 
10 India IND    2000 Central and S… san_od        no sanitati… national   74.1  
# ℹ 935 more rows
sanitation |> 
  filter(iso3 %in% c("UGA", "PER", "IND"))
# A tibble: 945 × 8
   name  iso3   year region_sdg     varname_short varname_long residence percent
   <chr> <chr> <dbl> <chr>          <chr>         <chr>        <chr>       <dbl>
 1 India IND    2000 Central and S… san_bas       basic sanit… national   15.0  
 2 India IND    2000 Central and S… san_bas       basic sanit… rural       2.25 
 3 India IND    2000 Central and S… san_bas       basic sanit… urban      48.4  
 4 India IND    2000 Central and S… san_lim       limited san… national    5.15 
 5 India IND    2000 Central and S… san_lim       limited san… rural       0.515
 6 India IND    2000 Central and S… san_lim       limited san… urban      17.3  
 7 India IND    2000 Central and S… san_unimp     unimproved … national    5.73 
 8 India IND    2000 Central and S… san_unimp     unimproved … rural       5.00 
 9 India IND    2000 Central and S… san_unimp     unimproved … urban       7.65 
10 India IND    2000 Central and S… san_od        no sanitati… national   74.1  
# ℹ 935 more rows
sanitation |> 
  filter(percent > 80)
# A tibble: 8,314 × 8
   name    iso3   year region_sdg   varname_short varname_long residence percent
   <chr>   <chr> <dbl> <chr>        <chr>         <chr>        <chr>       <dbl>
 1 Albania ALB    2000 Northern Am… san_bas       basic sanit… national     89.5
 2 Albania ALB    2000 Northern Am… san_bas       basic sanit… rural        84.2
 3 Albania ALB    2000 Northern Am… san_bas       basic sanit… urban        96.9
 4 Albania ALB    2001 Northern Am… san_bas       basic sanit… national     90.0
 5 Albania ALB    2001 Northern Am… san_bas       basic sanit… rural        84.9
 6 Albania ALB    2001 Northern Am… san_bas       basic sanit… urban        97.0
 7 Albania ALB    2002 Northern Am… san_bas       basic sanit… national     90.6
 8 Albania ALB    2002 Northern Am… san_bas       basic sanit… rural        85.7
 9 Albania ALB    2002 Northern Am… san_bas       basic sanit… urban        97.0
10 Albania ALB    2003 Northern Am… san_bas       basic sanit… national     91.2
# ℹ 8,304 more rows
sanitation |> 
  filter(percent <= 5)
# A tibble: 21,424 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 2 Afghanis… AFG    2001 Central a… san_lim       limited san… rural        3.14
 3 Afghanis… AFG    2002 Central a… san_lim       limited san… rural        3.35
 4 Afghanis… AFG    2003 Central a… san_lim       limited san… rural        3.57
 5 Afghanis… AFG    2004 Central a… san_lim       limited san… rural        3.79
 6 Afghanis… AFG    2005 Central a… san_lim       limited san… rural        4.01
 7 Afghanis… AFG    2005 Central a… san_od        no sanitati… urban        4.86
 8 Afghanis… AFG    2006 Central a… san_lim       limited san… rural        4.22
 9 Afghanis… AFG    2006 Central a… san_od        no sanitati… urban        4.45
10 Afghanis… AFG    2007 Central a… san_lim       limited san… rural        4.44
# ℹ 21,414 more rows
sanitation |> 
  filter(is.na(percent))
# A tibble: 19,743 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_sm        safely mana… national       NA
 2 Afghanis… AFG    2000 Central a… san_sm        safely mana… rural          NA
 3 Afghanis… AFG    2000 Central a… san_sm        safely mana… urban          NA
 4 Afghanis… AFG    2001 Central a… san_sm        safely mana… national       NA
 5 Afghanis… AFG    2001 Central a… san_sm        safely mana… rural          NA
 6 Afghanis… AFG    2001 Central a… san_sm        safely mana… urban          NA
 7 Afghanis… AFG    2002 Central a… san_sm        safely mana… national       NA
 8 Afghanis… AFG    2002 Central a… san_sm        safely mana… rural          NA
 9 Afghanis… AFG    2002 Central a… san_sm        safely mana… urban          NA
10 Afghanis… AFG    2003 Central a… san_sm        safely mana… national       NA
# ℹ 19,733 more rows
sanitation |> 
  filter(!is.na(percent))
# A tibble: 53,967 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 3 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 5 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 6 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 7 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 8 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 9 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
10 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
# ℹ 53,957 more rows
  • Keyboard shortcut for vertical bar | (OR) in US/CH is: Shift + / (Windows) and Option + / (Mac)
  • Keyboard shortcut for vertical bar | (OR) in UK: It’s complitcated
  • Keyboard shortcut for pipe operator: Ctrl / Cmd + Shift + M
  • Keyboard shortcut for assignment operator: Alt + -

Storing a resulting data frame as a new object

sanitation_national_2020_sm <- sanitation |> 
  filter(residence == "national",
         year == 2020,
         varname_short == "san_sm")

arrange()

The function arrange() changes the order of the rows.

sanitation_national_2020_sm |> 
  arrange(percent)
# A tibble: 234 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Ethiopia  ETH    2020 Sub-Sahar… san_sm        safely mana… national     6.68
 2 Togo      TGO    2020 Sub-Sahar… san_sm        safely mana… national     9.13
 3 Chad      TCD    2020 Sub-Sahar… san_sm        safely mana… national    10.1 
 4 Madagasc… MDG    2020 Sub-Sahar… san_sm        safely mana… national    10.4 
 5 Guinea-B… GNB    2020 Sub-Sahar… san_sm        safely mana… national    12.2 
 6 North Ma… MKD    2020 Northern … san_sm        safely mana… national    12.2 
 7 Democrat… COD    2020 Sub-Sahar… san_sm        safely mana… national    12.7 
 8 Ghana     GHA    2020 Sub-Sahar… san_sm        safely mana… national    13.3 
 9 Central … CAF    2020 Sub-Sahar… san_sm        safely mana… national    13.6 
10 Sierra L… SLE    2020 Sub-Sahar… san_sm        safely mana… national    14.0 
# ℹ 224 more rows
sanitation_national_2020_sm |> 
  arrange(desc(percent))
# A tibble: 234 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Andorra   AND    2020 Northern … san_sm        safely mana… national    100. 
 2 Kuwait    KWT    2020 Western A… san_sm        safely mana… national    100  
 3 Monaco    MCO    2020 Northern … san_sm        safely mana… national    100  
 4 Singapore SGP    2020 Eastern a… san_sm        safely mana… national    100  
 5 Republic… KOR    2020 Eastern a… san_sm        safely mana… national     99.9
 6 Switzerl… CHE    2020 Northern … san_sm        safely mana… national     99.7
 7 Austria   AUT    2020 Northern … san_sm        safely mana… national     99.6
 8 United A… ARE    2020 Western A… san_sm        safely mana… national     99.2
 9 Liechten… LIE    2020 Northern … san_sm        safely mana… national     98.8
10 United S… USA    2020 Northern … san_sm        safely mana… national     98.3
# ℹ 224 more rows

select()

The select() function chooses columns based on their names.

sanitation_national_2020_sm |> 
  select(name, percent)
# A tibble: 234 × 2
   name                percent
   <chr>                 <dbl>
 1 Afghanistan            NA  
 2 Albania                47.7
 3 Algeria                17.6
 4 American Samoa         NA  
 5 Andorra               100. 
 6 Angola                 NA  
 7 Anguilla               NA  
 8 Antigua and Barbuda    NA  
 9 Argentina              NA  
10 Armenia                69.3
# ℹ 224 more rows
sanitation_national_2020_sm |> 
  select(-varname_short)
# A tibble: 234 × 7
   name                iso3   year region_sdg     varname_long residence percent
   <chr>               <chr> <dbl> <chr>          <chr>        <chr>       <dbl>
 1 Afghanistan         AFG    2020 Central and S… safely mana… national     NA  
 2 Albania             ALB    2020 Northern Amer… safely mana… national     47.7
 3 Algeria             DZA    2020 Western Asia … safely mana… national     17.6
 4 American Samoa      ASM    2020 Oceania        safely mana… national     NA  
 5 Andorra             AND    2020 Northern Amer… safely mana… national    100. 
 6 Angola              AGO    2020 Sub-Saharan A… safely mana… national     NA  
 7 Anguilla            AIA    2020 Latin America… safely mana… national     NA  
 8 Antigua and Barbuda ATG    2020 Latin America… safely mana… national     NA  
 9 Argentina           ARG    2020 Latin America… safely mana… national     NA  
10 Armenia             ARM    2020 Western Asia … safely mana… national     69.3
# ℹ 224 more rows
sanitation_national_2020_sm |> 
  select(name:region_sdg, percent)
# A tibble: 234 × 5
   name                iso3   year region_sdg                       percent
   <chr>               <chr> <dbl> <chr>                              <dbl>
 1 Afghanistan         AFG    2020 Central and Southern Asia           NA  
 2 Albania             ALB    2020 Northern America and Europe         47.7
 3 Algeria             DZA    2020 Western Asia and Northern Africa    17.6
 4 American Samoa      ASM    2020 Oceania                             NA  
 5 Andorra             AND    2020 Northern America and Europe        100. 
 6 Angola              AGO    2020 Sub-Saharan Africa                  NA  
 7 Anguilla            AIA    2020 Latin America and the Caribbean     NA  
 8 Antigua and Barbuda ATG    2020 Latin America and the Caribbean     NA  
 9 Argentina           ARG    2020 Latin America and the Caribbean     NA  
10 Armenia             ARM    2020 Western Asia and Northern Africa    69.3
# ℹ 224 more rows

rename()

The rename() function changes the names of variables.

sanitation |> 
  rename(country = name)
# A tibble: 73,710 × 8
   country   iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 3 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 5 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 6 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 7 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 8 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 9 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
10 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
# ℹ 73,700 more rows

mutate()

The mutate() function adds new variables based on existing variables or external data.

sanitation |> 
  mutate(prop = percent / 100)
# A tibble: 73,710 × 9
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 3 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 5 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 6 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 7 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 8 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 9 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
10 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
# ℹ 73,700 more rows
# ℹ 1 more variable: prop <dbl>
sanitation |> 
  mutate(id = seq(1:n()))
# A tibble: 73,710 × 9
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2000 Central a… san_bas       basic sanit… national    21.9 
 2 Afghanis… AFG    2000 Central a… san_bas       basic sanit… rural       19.3 
 3 Afghanis… AFG    2000 Central a… san_bas       basic sanit… urban       30.9 
 4 Afghanis… AFG    2000 Central a… san_lim       limited san… national     5.65
 5 Afghanis… AFG    2000 Central a… san_lim       limited san… rural        3.14
 6 Afghanis… AFG    2000 Central a… san_lim       limited san… urban       14.5 
 7 Afghanis… AFG    2000 Central a… san_unimp     unimproved … national    46.7 
 8 Afghanis… AFG    2000 Central a… san_unimp     unimproved … rural       46.3 
 9 Afghanis… AFG    2000 Central a… san_unimp     unimproved … urban       48.1 
10 Afghanis… AFG    2000 Central a… san_od        no sanitati… national    25.8 
# ℹ 73,700 more rows
# ℹ 1 more variable: id <int>

relocate()

sanitation |> 
  mutate(id = 1:n()) |> 
  relocate(id)
# A tibble: 73,710 × 9
      id name        iso3   year region_sdg varname_short varname_long residence
   <int> <chr>       <chr> <dbl> <chr>      <chr>         <chr>        <chr>    
 1     1 Afghanistan AFG    2000 Central a… san_bas       basic sanit… national 
 2     2 Afghanistan AFG    2000 Central a… san_bas       basic sanit… rural    
 3     3 Afghanistan AFG    2000 Central a… san_bas       basic sanit… urban    
 4     4 Afghanistan AFG    2000 Central a… san_lim       limited san… national 
 5     5 Afghanistan AFG    2000 Central a… san_lim       limited san… rural    
 6     6 Afghanistan AFG    2000 Central a… san_lim       limited san… urban    
 7     7 Afghanistan AFG    2000 Central a… san_unimp     unimproved … national 
 8     8 Afghanistan AFG    2000 Central a… san_unimp     unimproved … rural    
 9     9 Afghanistan AFG    2000 Central a… san_unimp     unimproved … urban    
10    10 Afghanistan AFG    2000 Central a… san_od        no sanitati… national 
# ℹ 73,700 more rows
# ℹ 1 more variable: percent <dbl>
sanitation |> 
  mutate(id = 1:n()) |> 
  relocate(id, .before = name)
# A tibble: 73,710 × 9
      id name        iso3   year region_sdg varname_short varname_long residence
   <int> <chr>       <chr> <dbl> <chr>      <chr>         <chr>        <chr>    
 1     1 Afghanistan AFG    2000 Central a… san_bas       basic sanit… national 
 2     2 Afghanistan AFG    2000 Central a… san_bas       basic sanit… rural    
 3     3 Afghanistan AFG    2000 Central a… san_bas       basic sanit… urban    
 4     4 Afghanistan AFG    2000 Central a… san_lim       limited san… national 
 5     5 Afghanistan AFG    2000 Central a… san_lim       limited san… rural    
 6     6 Afghanistan AFG    2000 Central a… san_lim       limited san… urban    
 7     7 Afghanistan AFG    2000 Central a… san_unimp     unimproved … national 
 8     8 Afghanistan AFG    2000 Central a… san_unimp     unimproved … rural    
 9     9 Afghanistan AFG    2000 Central a… san_unimp     unimproved … urban    
10    10 Afghanistan AFG    2000 Central a… san_od        no sanitati… national 
# ℹ 73,700 more rows
# ℹ 1 more variable: percent <dbl>

summarise()

The summarise() function reduces multiple values down to a single summary.

sanitation_national_2020_sm |> 
  summarise()
# A tibble: 1 × 0
sanitation_national_2020_sm |> 
  summarise(mean_percent = mean(percent))
# A tibble: 1 × 1
  mean_percent
         <dbl>
1           NA
sanitation_national_2020_sm |> 
  summarise(mean_percent = mean(percent, na.rm = TRUE))
# A tibble: 1 × 1
  mean_percent
         <dbl>
1         60.3
sanitation_national_2020_sm |> 
  summarise(n = n(),
            mean_percent = mean(percent, na.rm = TRUE))
# A tibble: 1 × 2
      n mean_percent
  <int>        <dbl>
1   234         60.3
sanitation_national_2020_sm |> 
  filter(!is.na(percent)) |> 
  summarise(n = n(),
            mean_percent = mean(percent),
            sd_percent = sd(percent))
# A tibble: 1 × 3
      n mean_percent sd_percent
  <int>        <dbl>      <dbl>
1   120         60.3       29.9

group_by()

The group_by() function is used to group the data by one or more variables.

sanitation_national_2020_sm |> 
  group_by(region_sdg)
# A tibble: 234 × 8
# Groups:   region_sdg [8]
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Afghanis… AFG    2020 Central a… san_sm        safely mana… national     NA  
 2 Albania   ALB    2020 Northern … san_sm        safely mana… national     47.7
 3 Algeria   DZA    2020 Western A… san_sm        safely mana… national     17.6
 4 American… ASM    2020 Oceania    san_sm        safely mana… national     NA  
 5 Andorra   AND    2020 Northern … san_sm        safely mana… national    100. 
 6 Angola    AGO    2020 Sub-Sahar… san_sm        safely mana… national     NA  
 7 Anguilla  AIA    2020 Latin Ame… san_sm        safely mana… national     NA  
 8 Antigua … ATG    2020 Latin Ame… san_sm        safely mana… national     NA  
 9 Argentina ARG    2020 Latin Ame… san_sm        safely mana… national     NA  
10 Armenia   ARM    2020 Western A… san_sm        safely mana… national     69.3
# ℹ 224 more rows
sanitation_national_2020_sm |> 
  group_by(region_sdg) |> 
  summarise(n = n(),
            mean_percent = mean(percent),
            sd_percent = sd(percent))
# A tibble: 8 × 4
  region_sdg                           n mean_percent sd_percent
  <chr>                            <int>        <dbl>      <dbl>
1 Australia and New Zealand            2         78.2       5.61
2 Central and Southern Asia           14         NA        NA   
3 Eastern and South-Eastern Asia      18         NA        NA   
4 Latin America and the Caribbean     50         NA        NA   
5 Northern America and Europe         53         NA        NA   
6 Oceania                             21         NA        NA   
7 Sub-Saharan Africa                  51         NA        NA   
8 Western Asia and Northern Africa    25         NA        NA   
sanitation_national_2020_sm |> 
  filter(!is.na(percent)) |> 
  group_by(region_sdg) |> 
  summarise(n = n(),
            mean_percent = mean(percent),
            sd_percent = sd(percent))
# A tibble: 8 × 4
  region_sdg                           n mean_percent sd_percent
  <chr>                            <int>        <dbl>      <dbl>
1 Australia and New Zealand            2         78.2       5.61
2 Central and Southern Asia            5         58.2      21.5 
3 Eastern and South-Eastern Asia      11         69.8      21.4 
4 Latin America and the Caribbean     14         43.4      16.8 
5 Northern America and Europe         44         81.9      19.9 
6 Oceania                              3         36.1      10.7 
7 Sub-Saharan Africa                  21         21.4      10.9 
8 Western Asia and Northern Africa    20         62.7      29.5 

count()

The count() function is a convenient wrapper for group_by() and summarise(n = n()). You can prepare frequency tables with count().

sanitation |> 
  count(region_sdg)
# A tibble: 8 × 2
  region_sdg                           n
  <chr>                            <int>
1 Australia and New Zealand          630
2 Central and Southern Asia         4410
3 Eastern and South-Eastern Asia    5670
4 Latin America and the Caribbean  15750
5 Northern America and Europe      16695
6 Oceania                           6615
7 Sub-Saharan Africa               16065
8 Western Asia and Northern Africa  7875
sanitation |> 
  count(varname_short)
# A tibble: 5 × 2
  varname_short     n
  <chr>         <int>
1 san_bas       14742
2 san_lim       14742
3 san_od        14742
4 san_sm        14742
5 san_unimp     14742
sanitation |> 
  count(varname_long)
# A tibble: 5 × 2
  varname_long                           n
  <chr>                              <int>
1 basic sanitation services          14742
2 limited sanitation services        14742
3 no sanitation facilities           14742
4 safely managed sanitation services 14742
5 unimproved sanitation facilities   14742
sanitation |> 
  count(varname_short, varname_long)
# A tibble: 5 × 3
  varname_short varname_long                           n
  <chr>         <chr>                              <int>
1 san_bas       basic sanitation services          14742
2 san_lim       limited sanitation services        14742
3 san_od        no sanitation facilities           14742
4 san_sm        safely managed sanitation services 14742
5 san_unimp     unimproved sanitation facilities   14742

Module 03b: Filter function

library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)

Import

In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.

The data set jmp_wld_sanitation_long is available in the data folder of this repository. The data set is in long format and contains the following variables:

  • name: country name
  • iso3: ISO3 country code
  • year: year of observation
  • region_sdg: SDG region
  • residence: residence type (national, rural, urban)
  • varname_short: short variable name (JMP naming convention)
  • varname_long: long variable name (JMP naming convention)

We use the read_csv() function to import the data set into R.

sanitation <- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv")

Transform

Task 1.1

  1. Run all code chunks above.
  2. Use the filter() function to create a subset from the sanitation data containing national estimates for the year 2020.
  3. Store the result as a new object in your environment with the name sanitation_national_2020
sanitation_national_2020 <- sanitation |> 
  filter(residence == "national", year == 2020)

Task 1.2

  1. Use the filter() function to create a subset from the sanitation data containing urban and rural estimates for Nigeria.
  2. Store the result as a new object in your environment with the name sanitation_nigeria_urban_rural
sanitation_nigeria_urban_rural <- sanitation |> 
  filter(name == "Nigeria", residence != "national")

Task 1.3 (stretch goal)

  1. Use the ggplot() function to create a connected scatterplot with geom_point() and geom_line() for the data you created in Task 1.2.

  2. Use the aes() function to map the year variable to the x-axis, the percent variable to the y-axis, and the varname_short variable to color and group aesthetic.

  3. Use facet_wrap() to create a separate plot urban and rural populations.

  4. Change the colors using scale_color_colorblind().

ggplot(data = sanitation_nigeria_urban_rural,
       mapping = aes(x = year, 
                     y = percent, 
                     group = varname_short, 
                     color = varname_short)) +
  geom_point() +
  geom_line() +
  facet_wrap(~residence) +
  scale_color_colorblind() 

Module 03c: Summary data transformation

library(readr)
library(dplyr)

Import

In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.

The data set jmp_wld_sanitation_long is available in the data folder of this repository. The data set is in long format and contains the following variables:

  • name: country name
  • iso3: ISO3 country code
  • year: year of observation
  • region_sdg: SDG region
  • residence: residence type (national, rural, urban)
  • varname_short: short variable name (JMP naming convention)
  • varname_long: long variable name (JMP naming convention)

We use the read_csv() function to import the data set into R.

sanitation <- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv")

Task 1.1

  1. Run all code chunks above.
  2. Use the glimpse() function to get an overview of the data set.
  3. How many variables are in the data set?
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg    <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long  <chr> "basic sanitation services", "basic sanitation services"…
$ residence     <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent       <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…

Transform

Task 2.1

  1. Use the count() function to identify how many SDG regions are included in the data set.
  2. How many SDG regions are in the data set?
sanitation |> 
  count(region_sdg)
# A tibble: 8 × 2
  region_sdg                           n
  <chr>                            <int>
1 Australia and New Zealand          630
2 Central and Southern Asia         4410
3 Eastern and South-Eastern Asia    5670
4 Latin America and the Caribbean  15750
5 Northern America and Europe      16695
6 Oceania                           6615
7 Sub-Saharan Africa               16065
8 Western Asia and Northern Africa  7875

Task 2.2

  1. Use the count() function to identify the levels in the varname_short and varname_long variables.
  2. Which indicator in varname_long does san_od refer to?
sanitation |> 
  count(varname_short, varname_long)
# A tibble: 5 × 3
  varname_short varname_long                           n
  <chr>         <chr>                              <int>
1 san_bas       basic sanitation services          14742
2 san_lim       limited sanitation services        14742
3 san_od        no sanitation facilities           14742
4 san_sm        safely managed sanitation services 14742
5 san_unimp     unimproved sanitation facilities   14742

Task 2.3

  1. Use the filter() function to create a subset from the sanitation data containing national estimates for people with “no sanitation facilities” for the year 2020.

  2. Store the result as a new object in your environment with the name sanitation_national_2020_od.

sanitation_national_2020_od <- sanitation |> 
  filter(residence == "national", 
         year == 2020,
         varname_short == "san_od")

Task 2.4

  1. Use the sanitation_national_2020_od data and the count() function to identify the number of countries with 0% for the indicator “no sanitation facilities” in 2020.
sanitation_national_2020_od |> 
  count(percent)
# A tibble: 104 × 2
   percent     n
     <dbl> <int>
 1 0          96
 2 0.00670     1
 3 0.0107      1
 4 0.0169      1
 5 0.0317      1
 6 0.0418      1
 7 0.0965      1
 8 0.100       1
 9 0.105       1
10 0.127       1
# ℹ 94 more rows

Task 2.5

  1. How many countries in sanitation_national_2020_od data had no estimate for “no sanitation facilities” in 2020? Tipp: A country without an estimate has NA for the percent variable.
sanitation_national_2020_od |> 
  filter(is.na(percent))
# A tibble: 36 × 8
   name      iso3   year region_sdg varname_short varname_long residence percent
   <chr>     <chr> <dbl> <chr>      <chr>         <chr>        <chr>       <dbl>
 1 Anguilla  AIA    2020 Latin Ame… san_od        no sanitati… national       NA
 2 Antigua … ATG    2020 Latin Ame… san_od        no sanitati… national       NA
 3 Argentina ARG    2020 Latin Ame… san_od        no sanitati… national       NA
 4 Aruba     ABW    2020 Latin Ame… san_od        no sanitati… national       NA
 5 Azerbaij… AZE    2020 Western A… san_od        no sanitati… national       NA
 6 Bahamas   BHS    2020 Latin Ame… san_od        no sanitati… national       NA
 7 Barbados  BRB    2020 Latin Ame… san_od        no sanitati… national       NA
 8 Bosnia a… BIH    2020 Northern … san_od        no sanitati… national       NA
 9 British … VGB    2020 Latin Ame… san_od        no sanitati… national       NA
10 Brunei D… BRN    2020 Eastern a… san_od        no sanitati… national       NA
# ℹ 26 more rows

Task 2.6

  1. Use the sanitation_national_2020_od data in combination with group_by() and summarise() functions to calculate the mean, standard deviation and number of countries for the indicator “no sanitation facilities” in 2020.

  2. How did you treat the missing values for the percent variable in the calculation?

sanitation_national_2020_od |> 
  filter(!is.na(percent)) |>
  group_by(region_sdg) |> 
  summarise(
    mean = mean(percent, na.rm = TRUE),
    sd = sd(percent, na.rm = TRUE),
    n = n()
  )
# A tibble: 8 × 4
  region_sdg                           mean      sd     n
  <chr>                               <dbl>   <dbl> <int>
1 Australia and New Zealand         0        0          2
2 Central and Southern Asia         3.29     5.38      13
3 Eastern and South-Eastern Asia    5.16     6.97      16
4 Latin America and the Caribbean   2.04     3.70      34
5 Northern America and Europe       0.00698  0.0271    48
6 Oceania                           6.80    13.1       16
7 Sub-Saharan Africa               19.4     18.4       47
8 Western Asia and Northern Africa  1.67     5.41      22

Assignment 03: Data transformation with dplyr

library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)

Import

In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.

The data set jmp_wld_sanitation_long is available in the data folder of this repository. The data set is in long format and contains the following variables:

  • name: country name
  • iso3: ISO3 country code
  • year: year of observation
  • region_sdg: SDG region
  • residence: residence type (national, rural, urban)
  • varname_short: short variable name (JMP naming convention)
  • varname_long: long variable name (JMP naming convention)

We use the read_csv() function to import the data set into R.

sanitation <- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv")

Task 1

  1. Run all code chunks above.
  2. Use the glimpse() function to get an overview of the data set.
  3. How many variables are in the data set?
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg    <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long  <chr> "basic sanitation services", "basic sanitation services"…
$ residence     <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent       <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…

Task 2

  1. Use the count() function with varname_short and varname_long to identify the definitions of the levels in these two variables.
sanitation |> 
  count(varname_short, varname_long)
# A tibble: 5 × 3
  varname_short varname_long                           n
  <chr>         <chr>                              <int>
1 san_bas       basic sanitation services          14742
2 san_lim       limited sanitation services        14742
3 san_od        no sanitation facilities           14742
4 san_sm        safely managed sanitation services 14742
5 san_unimp     unimproved sanitation facilities   14742

Task 3

  1. Use the filter() function to create a subset of the data set that only contains observations:
  • for a country of your choice,
  • for the year 2000 and 2020,
  • for all variables that are not “safely managed sanitation services”.
  1. Store the result as a new object in your environment with a name of your choice.
sanitation_uga <- sanitation |> 
  filter(iso3 == "UGA",
         year %in% c(2000, 2020), 
         varname_short != "san_sm")

Task 4

  1. Use the count() function with the data you created in Task 3 to verify that year 2000 and 2020 remained in the year variable.
sanitation_uga |> 
  count(year)
# A tibble: 2 × 2
   year     n
  <dbl> <int>
1  2000    12
2  2020    12

Task 5

  1. Use the ggplot() function to create a bar plot with geom_col() for the data you created in Task 3.

  2. Use the aes() function to map the residence variable to the x-axis, the percent variable to the y-axis, and the varname_long variable to the fill aesthetic.

  3. Use facet_wrap() to create a separate plot for each year.

  4. Change the fill colors using scale_fill_colorblind().

  5. Add labels to the bars by copying the code below this bullet point and adding it to your code for the plot.

geom_text(aes(label = round(percent, 1)), 
          position = position_stack(vjust = 0.5),
          size = 3,
          color = "white") 
ggplot(data = sanitation_uga,
       mapping = aes(x = residence, 
                     y = percent, 
                     fill = varname_long)) +
  geom_col() +
  facet_wrap(~year) +
  scale_fill_colorblind() +
  geom_text(aes(label = round(percent, 1)), 
            position = position_stack(vjust = 0.5),
            size = 3,
            color = "white") 

Task 6

If you haven’t worked with JMP indicators before, the following questions will be challenging to answer.

  1. Look at the plot that you created. What do you notice about the order of the bars / order of the legend?
  2. What would you want to change?
  3. Why did we remove “safely managed sanitation services” from the data set in Task 3?

Task 7

  1. Run the code in the code chunk below.
  2. What do you observe when you look at the code and plot?
sanitation_2020 <- sanitation |> 
  filter(year == 2020)

ggplot(data = sanitation_2020,
       mapping = aes(x = percent, fill = varname_short)) +
  geom_histogram() +
  facet_grid(varname_short ~ residence, scales = "free_y") +
  scale_fill_colorblind() +
  theme(legend.position = "none") 

Module 04a: Factors

library(ggplot2)
library(dplyr)
library(readr)
library(ggthemes)

Import

waste <- read_csv("/cloud/project/data/processed/waste-city-level-sml.csv") 

Explore

  1. Run all code chunks above.

  2. Use the glimpse() function to inspect the waste object.

  3. What does the data cover? Briefly discuss with your room partner.

glimpse(waste)
Rows: 367
Columns: 6
$ country              <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg…
$ city                 <chr> "Jalalabad", "Kandahar", "Mazar-E-Sharif", "Kabul…
$ iso3c                <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AGO", "ALB", …
$ income_id            <chr> "LIC", "LIC", "LIC", "LIC", "LIC", "LMC", "UMC", …
$ generation_tons_year <dbl> 58914.45, 120971.00, 52368.40, 1989250.00, 91644.…
$ population           <dbl> 326585, 429000, 635250, 3700000, 337000, 4508000,…
  1. Use the count() function for the waste object to count the number of rows for each value of the income_id variable.

  2. What do the four values of the income_id variable represent?

waste |> 
  count(income_id)
# A tibble: 4 × 2
  income_id     n
  <chr>     <int>
1 HIC          88
2 LIC          74
3 LMC         124
4 UMC          81

Transform

  1. Use the c() function to create a vector with the following values: “HIC”, “UMC”, “LMC”, “LIC”.
  2. Use the assignment operator (<-) to store the resulting vector as a new object called levels_income.
levels_income <- c("HIC", "UMC", "LMC", "LIC")
  1. Use the mutate() function to convert the income_id variable to a factor variable with the levels specified in the levels_income object.

  2. Use the assignment operator (<-) to store the resulting data as a new object called waste_lvl.

waste_lvl <- waste |> 
  mutate(income_id = factor(income_id, levels = levels_income))
  1. Use the count() function to verify that the income_id variable is now a factor variable with the correct levels.
waste_lvl |> 
  count(income_id)
# A tibble: 4 × 2
  income_id     n
  <fct>     <int>
1 HIC          88
2 UMC          81
3 LMC         124
4 LIC          74
  1. Starting with waste_lvl, use the mutate() function to create a new variable called generation_kg_capita that contains the generation_tons_year variable divided by the population variable and multiplied with 1000.

  2. Use the assignment operator (<-) to store the resulting data as a new object called waste_capita.

waste_capita <- waste_lvl |> 
  mutate(generation_kg_capita = generation_tons_year / population * 1000) 

Visualize

  1. Next to the code chunk option #| eval: change the value from false to true.

  2. Run the code in the code-chunk below to create a boxplot of the generation_kg_capita variable by income_id.

  3. What do you observe? Discuss with your room partner.

ggplot(data = waste_capita,
       mapping = aes(x = income_id, 
                     y = generation_kg_capita, 
                     color = income_id)) +
  geom_boxplot(outlier.fill = NA) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  scale_color_colorblind() +
  labs(x = "Income group",
       y = "Waste generation (tons per capita per year)")

Module 04b: Data import

library(readr)
library(readxl)
library(dplyr)

Import

Task 1: Import waste data as CSV

  1. Run all code chunks above.
  2. Use the read_csv() function to import the waste-city-level.csv file from the data/raw folder.
  3. Assign the resulting data to an object called waste.
waste <- read_csv("/cloud/project/data/raw/waste-city-level.csv")

Task 2: Import JMP data as CSV

  1. Use the read_csv() function to import the jmp_wld_sanitation_long.csv file from the data/processed folder.
  2. Assign the resulting data to an object called san_csv.
san_csv <- read_csv("/cloud/project/data/processed/jmp_wld_sanitation_long.csv")

Task 3: Import JMP data as RDS

  1. Use the read_rds() function to import the jmp_wld_sanitation_long.rds file from the data/processed folder.
  2. Assign the resulting data to an object called san_rds.
san_rds <- read_rds("/cloud/project/data/processed/jmp_wld_sanitation_long.rds")

Task 4: Compare CSV and RDS

  1. Use the glimpse() function to inspect the san_csv and san_rds objects.
  2. What is the difference between the two objects? Discuss with your room partner.
glimpse(san_csv)
Rows: 73,710
Columns: 8
$ name          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg    <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long  <chr> "basic sanitation services", "basic sanitation services"…
$ residence     <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent       <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
glimpse(san_rds)
Rows: 73,710
Columns: 8
$ name          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg    <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <fct> san_bas, san_bas, san_bas, san_lim, san_lim, san_lim, sa…
$ varname_long  <fct> basic sanitation services, basic sanitation services, ba…
$ residence     <fct> national, rural, urban, national, rural, urban, national…
$ percent       <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…

Task 5: Use LLM for an explanation

  1. Open https://www.perplexity.ai/ in your browser and enter the following prompt:

You are an experienced educator in teaching R to novice users without prior knowledge. Explain what the .rds format is and how it differs from the .csv file format. Avoid technical language.

  1. Read the answer and ask the tool questions for clarification of something is unclear.

  2. Share a link to your conversation here (see screenshot below):

Screenshot


Module 05: Conditions

library(tidyverse)
library(ggthemes)

Import

We continue to work with a subset of the “What a Waste” database.

waste <- read_csv("/cloud/project/data/processed/waste-city-level-sml.csv")

We will also use an example spreadsheet that was created by one of the course participants.

solids <- readxl::read_excel("/cloud/project/data/raw/TS_poo_2022.xlsx")

Explore

glimpse(waste)
Rows: 367
Columns: 6
$ country              <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg…
$ city                 <chr> "Jalalabad", "Kandahar", "Mazar-E-Sharif", "Kabul…
$ iso3c                <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AGO", "ALB", …
$ income_id            <chr> "LIC", "LIC", "LIC", "LIC", "LIC", "LMC", "UMC", …
$ generation_tons_year <dbl> 58914.45, 120971.00, 52368.40, 1989250.00, 91644.…
$ population           <dbl> 326585, 429000, 635250, 3700000, 337000, 4508000,…
waste |> 
    count(income_id)
# A tibble: 4 × 2
  income_id     n
  <chr>     <int>
1 HIC          88
2 LIC          74
3 LMC         124
4 UMC          81

Transform

Conditional statements with mutate() & case_when() of dplyr R package

waste data

waste_cat <- waste |> 
    mutate(generation_kg_capita = generation_tons_year / population * 1000) |> 
    mutate(income_cat = case_when(
        income_id == "HIC" ~ "high income",
        income_id == "UMC" ~ "upper-middle income",
        income_id == "LMC" ~ "lower-middle income",
        income_id == "LIC" ~ "low income"
    ))


levels_income <- c("HIC", "UMC", "LMC", "LIC")

levels_income_cat <- c("high income", 
                       "upper-middle income",
                       "lower-middle income",
                       "low income")

waste_fct <- waste_cat |>
    mutate(income_id = factor(income_id, levels = levels_income)) |> 
    mutate(income_cat = factor(income_cat, levels = levels_income_cat)) |> 
    relocate(income_cat, .after = income_id)
write_rds(x = waste_fct, file = "/cloud/project/data/processed/waste-city-level-sml.rds")

Faecal sludge solids data

solids |> 
    mutate(total_solids_gL = case_when(
        source_type == "septic tank" ~ total_solids_gL * 100,
        .default = total_solids_gL
    ))
# A tibble: 20 × 5
   source_location source_type Sample_Date         n_daily_users total_solids_gL
   <chr>           <chr>       <dttm>                      <dbl>           <dbl>
 1 household       pit latrine 2022-11-01 00:00:00             5            20.5
 2 household       pit latrine 2022-11-01 00:00:00             7            25.8
 3 household       pit latrine 2022-11-01 00:00:00             7            22.6
 4 household       pit latrine 2022-11-01 00:00:00             6            30.9
 5 household       pit latrine 2022-11-01 00:00:00             8            48.3
 6 household       septic tank 2022-11-02 00:00:00             9             8  
 7 household       septic tank 2022-11-02 00:00:00             6            11  
 8 household       septic tank 2022-11-02 00:00:00             7             5  
 9 household       septic tank 2022-11-02 00:00:00             7            13  
10 household       septic tank 2022-11-02 00:00:00             5             9  
11 public toilet   pit latrine 2022-11-03 00:00:00            35            35.0
12 public toilet   pit latrine 2022-11-03 00:00:00            28            29.3
13 public toilet   pit latrine 2022-11-03 00:00:00            52            19.9
14 public toilet   pit latrine 2022-11-03 00:00:00            19            42.4
15 public toilet   pit latrine 2022-11-03 00:00:00            39            28.0
16 public toilet   septic tank 2022-11-04 00:00:00            75             7  
17 public toilet   septic tank 2022-11-04 00:00:00            53            14  
18 public toilet   septic tank 2022-11-04 00:00:00            47            19  
19 public toilet   septic tank 2022-11-04 00:00:00            39             9  
20 public toilet   septic tank 2022-11-04 00:00:00            62            11  

Visualize

Categories as character

ggplot(data = waste_cat,
       mapping = aes(x = income_cat, 
                     y = generation_kg_capita, 
                     color = income_cat)) +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(size = 3, width = 0.1, alpha = 0.3) +
    scale_color_colorblind() +
    labs(x = "Income group",
         y = "Waste generation (tons per capita per year)") 

Categories as factor

#| eval: true
ggplot(data = waste_fct,
       mapping = aes(x = income_cat, 
                     y = generation_kg_capita, 
                     color = income_cat)) +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(size = 3, width = 0.1, alpha = 0.3) +
    scale_color_colorblind() +
    labs(x = "Income group",
         y = "Waste generation (tons per capita per year)") 

Module 05a: case_when()

library(tidyverse)
library(readxl)

Import

We are using another faecal sludge solids example dataset

sludge <- read_xlsx("/cloud/project/data/raw/faecal-sludge-analysis.xlsx")

Task 1

  1. A mistake happened during data entry for sample id 16. Use mutate() and case_when() to change the ts value of 0.72 to 8.72.
sludge |> 
    mutate(ts = case_when(
        ts ==  0.72 ~ 8.72,
        .default = ts
    ))
# A tibble: 20 × 6
      id date_sample         system      location      users     ts
   <dbl> <dttm>              <chr>       <chr>         <dbl>  <dbl>
 1     1 2023-11-01 00:00:00 pit latrine household         5 136.  
 2     2 2023-11-01 00:00:00 pit latrine household         7 102.  
 3     3 2023-11-01 00:00:00 pit latrine household        NA  57.0 
 4     4 2023-11-01 00:00:00 pit latrine household         6  27.0 
 5     5 2023-11-01 00:00:00 pit latrine household        12  97.3 
 6     6 2023-11-02 00:00:00 pit latrine household         7  78.2 
 7     7 2023-11-02 00:00:00 septic tank household        14  15.2 
 8     8 2023-11-02 00:00:00 septic tank household         4  29.4 
 9     9 2023-11-02 00:00:00 septic tank household        10  64.2 
10    10 2023-11-02 00:00:00 septic tank household        12   8.01
11    11 2023-11-03 00:00:00 pit latrine public toilet    50  11.2 
12    12 2023-11-03 00:00:00 pit latrine public toilet    32  84.0 
13    13 2023-11-03 00:00:00 pit latrine public toilet    41  55.9 
14    14 2023-11-03 00:00:00 pit latrine public toilet   160  15.3 
15    15 2023-11-03 00:00:00 pit latrine public toilet    20  22.6 
16    16 2023-11-04 00:00:00 septic tank public toilet    26   8.72
17    17 2023-11-04 00:00:00 septic tank public toilet    91  43.9 
18    18 2023-11-04 00:00:00 septic tank public toilet    68  10.4 
19    19 2023-11-04 00:00:00 septic tank public toilet   112  23.2 
20    20 2023-11-04 00:00:00 septic tank public toilet    59  15.6 

Task 2

  1. Another mistake happened during data entry for sample id 6. Use mutate() and case_when() to change the system value of id 6 from “pit latrine” to “septic tank”.
sludge |> 
    mutate(system = case_when(
        id ==  6 ~ "septic tank",
        .default = system
    ))
# A tibble: 20 × 6
      id date_sample         system      location      users     ts
   <dbl> <dttm>              <chr>       <chr>         <dbl>  <dbl>
 1     1 2023-11-01 00:00:00 pit latrine household         5 136.  
 2     2 2023-11-01 00:00:00 pit latrine household         7 102.  
 3     3 2023-11-01 00:00:00 pit latrine household        NA  57.0 
 4     4 2023-11-01 00:00:00 pit latrine household         6  27.0 
 5     5 2023-11-01 00:00:00 pit latrine household        12  97.3 
 6     6 2023-11-02 00:00:00 septic tank household         7  78.2 
 7     7 2023-11-02 00:00:00 septic tank household        14  15.2 
 8     8 2023-11-02 00:00:00 septic tank household         4  29.4 
 9     9 2023-11-02 00:00:00 septic tank household        10  64.2 
10    10 2023-11-02 00:00:00 septic tank household        12   8.01
11    11 2023-11-03 00:00:00 pit latrine public toilet    50  11.2 
12    12 2023-11-03 00:00:00 pit latrine public toilet    32  84.0 
13    13 2023-11-03 00:00:00 pit latrine public toilet    41  55.9 
14    14 2023-11-03 00:00:00 pit latrine public toilet   160  15.3 
15    15 2023-11-03 00:00:00 pit latrine public toilet    20  22.6 
16    16 2023-11-04 00:00:00 septic tank public toilet    26   0.72
17    17 2023-11-04 00:00:00 septic tank public toilet    91  43.9 
18    18 2023-11-04 00:00:00 septic tank public toilet    68  10.4 
19    19 2023-11-04 00:00:00 septic tank public toilet   112  23.2 
20    20 2023-11-04 00:00:00 septic tank public toilet    59  15.6 

Task 3 (stretch goal)

  1. Add a new variable with the name ts_cat to the data frame that categorizes sludge samples into low, medium and high solids content.Use mutate() and case_when() to create the new variable.
  • samples with less than 15 g/L are categorized as low
  • samples with 15 g/L to 50 g/L are categorized as medium
  • samples with more than 50 g/L are categorized as high
sludge |> 
    mutate(ts_cat = case_when(
        ts < 15 ~ "low",
        ts >= 15 & ts <= 50 ~ "medium",
        ts > 50 ~ "high"
    ))
# A tibble: 20 × 7
      id date_sample         system      location      users     ts ts_cat
   <dbl> <dttm>              <chr>       <chr>         <dbl>  <dbl> <chr> 
 1     1 2023-11-01 00:00:00 pit latrine household         5 136.   high  
 2     2 2023-11-01 00:00:00 pit latrine household         7 102.   high  
 3     3 2023-11-01 00:00:00 pit latrine household        NA  57.0  high  
 4     4 2023-11-01 00:00:00 pit latrine household         6  27.0  medium
 5     5 2023-11-01 00:00:00 pit latrine household        12  97.3  high  
 6     6 2023-11-02 00:00:00 pit latrine household         7  78.2  high  
 7     7 2023-11-02 00:00:00 septic tank household        14  15.2  medium
 8     8 2023-11-02 00:00:00 septic tank household         4  29.4  medium
 9     9 2023-11-02 00:00:00 septic tank household        10  64.2  high  
10    10 2023-11-02 00:00:00 septic tank household        12   8.01 low   
11    11 2023-11-03 00:00:00 pit latrine public toilet    50  11.2  low   
12    12 2023-11-03 00:00:00 pit latrine public toilet    32  84.0  high  
13    13 2023-11-03 00:00:00 pit latrine public toilet    41  55.9  high  
14    14 2023-11-03 00:00:00 pit latrine public toilet   160  15.3  medium
15    15 2023-11-03 00:00:00 pit latrine public toilet    20  22.6  medium
16    16 2023-11-04 00:00:00 septic tank public toilet    26   0.72 low   
17    17 2023-11-04 00:00:00 septic tank public toilet    91  43.9  medium
18    18 2023-11-04 00:00:00 septic tank public toilet    68  10.4  low   
19    19 2023-11-04 00:00:00 septic tank public toilet   112  23.2  medium
20    20 2023-11-04 00:00:00 septic tank public toilet    59  15.6  medium

Module 05b: Dates

library(tidyverse)
library(readxl)

Transform to ISO

dates <- read_excel("/cloud/project/data/raw/date-formats.xlsx")

In R and other programming languages, dates are stored as numbers. The number of days since 1970-01-01 is the ISO 8601 standard.

In Excel, dates are stored as numbers of days since 1900-01-01. In Excel, the date number 1 corresponds to “1900-01-01,” but this system incorrectly considers 1900 as a leap year, which it is not. As a result, to correctly interpret date numbers that originate from systems like Excel, the origin “1899-12-30” is used to account for this discrepancy

dates_class <- dates |> 
    mutate(date_iso = as_date(date_iso)) |> 
    mutate(date_us = mdy(date_us)) |> 
    mutate(date_eu = dmy(date_eu)) |> 
    mutate(date_num = as_date(date_num, origin = "1899-12-30")) |> 
    mutate(date = as_date(date_time)) |> 
    mutate(date_time_tz = with_tz(date_time, tzone = "Africa/Kampala")) |>
    mutate(today = today())

OlsonNames()
  [1] "Africa/Abidjan"                   "Africa/Accra"                    
  [3] "Africa/Addis_Ababa"               "Africa/Algiers"                  
  [5] "Africa/Asmara"                    "Africa/Asmera"                   
  [7] "Africa/Bamako"                    "Africa/Bangui"                   
  [9] "Africa/Banjul"                    "Africa/Bissau"                   
 [11] "Africa/Blantyre"                  "Africa/Brazzaville"              
 [13] "Africa/Bujumbura"                 "Africa/Cairo"                    
 [15] "Africa/Casablanca"                "Africa/Ceuta"                    
 [17] "Africa/Conakry"                   "Africa/Dakar"                    
 [19] "Africa/Dar_es_Salaam"             "Africa/Djibouti"                 
 [21] "Africa/Douala"                    "Africa/El_Aaiun"                 
 [23] "Africa/Freetown"                  "Africa/Gaborone"                 
 [25] "Africa/Harare"                    "Africa/Johannesburg"             
 [27] "Africa/Juba"                      "Africa/Kampala"                  
 [29] "Africa/Khartoum"                  "Africa/Kigali"                   
 [31] "Africa/Kinshasa"                  "Africa/Lagos"                    
 [33] "Africa/Libreville"                "Africa/Lome"                     
 [35] "Africa/Luanda"                    "Africa/Lubumbashi"               
 [37] "Africa/Lusaka"                    "Africa/Malabo"                   
 [39] "Africa/Maputo"                    "Africa/Maseru"                   
 [41] "Africa/Mbabane"                   "Africa/Mogadishu"                
 [43] "Africa/Monrovia"                  "Africa/Nairobi"                  
 [45] "Africa/Ndjamena"                  "Africa/Niamey"                   
 [47] "Africa/Nouakchott"                "Africa/Ouagadougou"              
 [49] "Africa/Porto-Novo"                "Africa/Sao_Tome"                 
 [51] "Africa/Timbuktu"                  "Africa/Tripoli"                  
 [53] "Africa/Tunis"                     "Africa/Windhoek"                 
 [55] "America/Adak"                     "America/Anchorage"               
 [57] "America/Anguilla"                 "America/Antigua"                 
 [59] "America/Araguaina"                "America/Argentina/Buenos_Aires"  
 [61] "America/Argentina/Catamarca"      "America/Argentina/ComodRivadavia"
 [63] "America/Argentina/Cordoba"        "America/Argentina/Jujuy"         
 [65] "America/Argentina/La_Rioja"       "America/Argentina/Mendoza"       
 [67] "America/Argentina/Rio_Gallegos"   "America/Argentina/Salta"         
 [69] "America/Argentina/San_Juan"       "America/Argentina/San_Luis"      
 [71] "America/Argentina/Tucuman"        "America/Argentina/Ushuaia"       
 [73] "America/Aruba"                    "America/Asuncion"                
 [75] "America/Atikokan"                 "America/Atka"                    
 [77] "America/Bahia"                    "America/Bahia_Banderas"          
 [79] "America/Barbados"                 "America/Belem"                   
 [81] "America/Belize"                   "America/Blanc-Sablon"            
 [83] "America/Boa_Vista"                "America/Bogota"                  
 [85] "America/Boise"                    "America/Buenos_Aires"            
 [87] "America/Cambridge_Bay"            "America/Campo_Grande"            
 [89] "America/Cancun"                   "America/Caracas"                 
 [91] "America/Catamarca"                "America/Cayenne"                 
 [93] "America/Cayman"                   "America/Chicago"                 
 [95] "America/Chihuahua"                "America/Ciudad_Juarez"           
 [97] "America/Coral_Harbour"            "America/Cordoba"                 
 [99] "America/Costa_Rica"               "America/Creston"                 
[101] "America/Cuiaba"                   "America/Curacao"                 
[103] "America/Danmarkshavn"             "America/Dawson"                  
[105] "America/Dawson_Creek"             "America/Denver"                  
[107] "America/Detroit"                  "America/Dominica"                
[109] "America/Edmonton"                 "America/Eirunepe"                
[111] "America/El_Salvador"              "America/Ensenada"                
[113] "America/Fort_Nelson"              "America/Fort_Wayne"              
[115] "America/Fortaleza"                "America/Glace_Bay"               
[117] "America/Godthab"                  "America/Goose_Bay"               
[119] "America/Grand_Turk"               "America/Grenada"                 
[121] "America/Guadeloupe"               "America/Guatemala"               
[123] "America/Guayaquil"                "America/Guyana"                  
[125] "America/Halifax"                  "America/Havana"                  
[127] "America/Hermosillo"               "America/Indiana/Indianapolis"    
[129] "America/Indiana/Knox"             "America/Indiana/Marengo"         
[131] "America/Indiana/Petersburg"       "America/Indiana/Tell_City"       
[133] "America/Indiana/Vevay"            "America/Indiana/Vincennes"       
[135] "America/Indiana/Winamac"          "America/Indianapolis"            
[137] "America/Inuvik"                   "America/Iqaluit"                 
[139] "America/Jamaica"                  "America/Jujuy"                   
[141] "America/Juneau"                   "America/Kentucky/Louisville"     
[143] "America/Kentucky/Monticello"      "America/Knox_IN"                 
[145] "America/Kralendijk"               "America/La_Paz"                  
[147] "America/Lima"                     "America/Los_Angeles"             
[149] "America/Louisville"               "America/Lower_Princes"           
[151] "America/Maceio"                   "America/Managua"                 
[153] "America/Manaus"                   "America/Marigot"                 
[155] "America/Martinique"               "America/Matamoros"               
[157] "America/Mazatlan"                 "America/Mendoza"                 
[159] "America/Menominee"                "America/Merida"                  
[161] "America/Metlakatla"               "America/Mexico_City"             
[163] "America/Miquelon"                 "America/Moncton"                 
[165] "America/Monterrey"                "America/Montevideo"              
[167] "America/Montreal"                 "America/Montserrat"              
[169] "America/Nassau"                   "America/New_York"                
[171] "America/Nipigon"                  "America/Nome"                    
[173] "America/Noronha"                  "America/North_Dakota/Beulah"     
[175] "America/North_Dakota/Center"      "America/North_Dakota/New_Salem"  
[177] "America/Nuuk"                     "America/Ojinaga"                 
[179] "America/Panama"                   "America/Pangnirtung"             
[181] "America/Paramaribo"               "America/Phoenix"                 
[183] "America/Port_of_Spain"            "America/Port-au-Prince"          
[185] "America/Porto_Acre"               "America/Porto_Velho"             
[187] "America/Puerto_Rico"              "America/Punta_Arenas"            
[189] "America/Rainy_River"              "America/Rankin_Inlet"            
[191] "America/Recife"                   "America/Regina"                  
[193] "America/Resolute"                 "America/Rio_Branco"              
[195] "America/Rosario"                  "America/Santa_Isabel"            
[197] "America/Santarem"                 "America/Santiago"                
[199] "America/Santo_Domingo"            "America/Sao_Paulo"               
[201] "America/Scoresbysund"             "America/Shiprock"                
[203] "America/Sitka"                    "America/St_Barthelemy"           
[205] "America/St_Johns"                 "America/St_Kitts"                
[207] "America/St_Lucia"                 "America/St_Thomas"               
[209] "America/St_Vincent"               "America/Swift_Current"           
[211] "America/Tegucigalpa"              "America/Thule"                   
[213] "America/Thunder_Bay"              "America/Tijuana"                 
[215] "America/Toronto"                  "America/Tortola"                 
[217] "America/Vancouver"                "America/Virgin"                  
[219] "America/Whitehorse"               "America/Winnipeg"                
[221] "America/Yakutat"                  "America/Yellowknife"             
[223] "Antarctica/Casey"                 "Antarctica/Davis"                
[225] "Antarctica/DumontDUrville"        "Antarctica/Macquarie"            
[227] "Antarctica/Mawson"                "Antarctica/McMurdo"              
[229] "Antarctica/Palmer"                "Antarctica/Rothera"              
[231] "Antarctica/South_Pole"            "Antarctica/Syowa"                
[233] "Antarctica/Troll"                 "Antarctica/Vostok"               
[235] "Arctic/Longyearbyen"              "Asia/Aden"                       
[237] "Asia/Almaty"                      "Asia/Amman"                      
[239] "Asia/Anadyr"                      "Asia/Aqtau"                      
[241] "Asia/Aqtobe"                      "Asia/Ashgabat"                   
[243] "Asia/Ashkhabad"                   "Asia/Atyrau"                     
[245] "Asia/Baghdad"                     "Asia/Bahrain"                    
[247] "Asia/Baku"                        "Asia/Bangkok"                    
[249] "Asia/Barnaul"                     "Asia/Beirut"                     
[251] "Asia/Bishkek"                     "Asia/Brunei"                     
[253] "Asia/Calcutta"                    "Asia/Chita"                      
[255] "Asia/Choibalsan"                  "Asia/Chongqing"                  
[257] "Asia/Chungking"                   "Asia/Colombo"                    
[259] "Asia/Dacca"                       "Asia/Damascus"                   
[261] "Asia/Dhaka"                       "Asia/Dili"                       
[263] "Asia/Dubai"                       "Asia/Dushanbe"                   
[265] "Asia/Famagusta"                   "Asia/Gaza"                       
[267] "Asia/Harbin"                      "Asia/Hebron"                     
[269] "Asia/Ho_Chi_Minh"                 "Asia/Hong_Kong"                  
[271] "Asia/Hovd"                        "Asia/Irkutsk"                    
[273] "Asia/Istanbul"                    "Asia/Jakarta"                    
[275] "Asia/Jayapura"                    "Asia/Jerusalem"                  
[277] "Asia/Kabul"                       "Asia/Kamchatka"                  
[279] "Asia/Karachi"                     "Asia/Kashgar"                    
[281] "Asia/Kathmandu"                   "Asia/Katmandu"                   
[283] "Asia/Khandyga"                    "Asia/Kolkata"                    
[285] "Asia/Krasnoyarsk"                 "Asia/Kuala_Lumpur"               
[287] "Asia/Kuching"                     "Asia/Kuwait"                     
[289] "Asia/Macao"                       "Asia/Macau"                      
[291] "Asia/Magadan"                     "Asia/Makassar"                   
[293] "Asia/Manila"                      "Asia/Muscat"                     
[295] "Asia/Nicosia"                     "Asia/Novokuznetsk"               
[297] "Asia/Novosibirsk"                 "Asia/Omsk"                       
[299] "Asia/Oral"                        "Asia/Phnom_Penh"                 
[301] "Asia/Pontianak"                   "Asia/Pyongyang"                  
[303] "Asia/Qatar"                       "Asia/Qostanay"                   
[305] "Asia/Qyzylorda"                   "Asia/Rangoon"                    
[307] "Asia/Riyadh"                      "Asia/Saigon"                     
[309] "Asia/Sakhalin"                    "Asia/Samarkand"                  
[311] "Asia/Seoul"                       "Asia/Shanghai"                   
[313] "Asia/Singapore"                   "Asia/Srednekolymsk"              
[315] "Asia/Taipei"                      "Asia/Tashkent"                   
[317] "Asia/Tbilisi"                     "Asia/Tehran"                     
[319] "Asia/Tel_Aviv"                    "Asia/Thimbu"                     
[321] "Asia/Thimphu"                     "Asia/Tokyo"                      
[323] "Asia/Tomsk"                       "Asia/Ujung_Pandang"              
[325] "Asia/Ulaanbaatar"                 "Asia/Ulan_Bator"                 
[327] "Asia/Urumqi"                      "Asia/Ust-Nera"                   
[329] "Asia/Vientiane"                   "Asia/Vladivostok"                
[331] "Asia/Yakutsk"                     "Asia/Yangon"                     
[333] "Asia/Yekaterinburg"               "Asia/Yerevan"                    
[335] "Atlantic/Azores"                  "Atlantic/Bermuda"                
[337] "Atlantic/Canary"                  "Atlantic/Cape_Verde"             
[339] "Atlantic/Faeroe"                  "Atlantic/Faroe"                  
[341] "Atlantic/Jan_Mayen"               "Atlantic/Madeira"                
[343] "Atlantic/Reykjavik"               "Atlantic/South_Georgia"          
[345] "Atlantic/St_Helena"               "Atlantic/Stanley"                
[347] "Australia/ACT"                    "Australia/Adelaide"              
[349] "Australia/Brisbane"               "Australia/Broken_Hill"           
[351] "Australia/Canberra"               "Australia/Currie"                
[353] "Australia/Darwin"                 "Australia/Eucla"                 
[355] "Australia/Hobart"                 "Australia/LHI"                   
[357] "Australia/Lindeman"               "Australia/Lord_Howe"             
[359] "Australia/Melbourne"              "Australia/North"                 
[361] "Australia/NSW"                    "Australia/Perth"                 
[363] "Australia/Queensland"             "Australia/South"                 
[365] "Australia/Sydney"                 "Australia/Tasmania"              
[367] "Australia/Victoria"               "Australia/West"                  
[369] "Australia/Yancowinna"             "Brazil/Acre"                     
[371] "Brazil/DeNoronha"                 "Brazil/East"                     
[373] "Brazil/West"                      "Canada/Atlantic"                 
[375] "Canada/Central"                   "Canada/Eastern"                  
[377] "Canada/Mountain"                  "Canada/Newfoundland"             
[379] "Canada/Pacific"                   "Canada/Saskatchewan"             
[381] "Canada/Yukon"                     "CET"                             
[383] "Chile/Continental"                "Chile/EasterIsland"              
[385] "CST6CDT"                          "Cuba"                            
[387] "EET"                              "Egypt"                           
[389] "Eire"                             "EST"                             
[391] "EST5EDT"                          "Etc/GMT"                         
[393] "Etc/GMT-0"                        "Etc/GMT-1"                       
[395] "Etc/GMT-10"                       "Etc/GMT-11"                      
[397] "Etc/GMT-12"                       "Etc/GMT-13"                      
[399] "Etc/GMT-14"                       "Etc/GMT-2"                       
[401] "Etc/GMT-3"                        "Etc/GMT-4"                       
[403] "Etc/GMT-5"                        "Etc/GMT-6"                       
[405] "Etc/GMT-7"                        "Etc/GMT-8"                       
[407] "Etc/GMT-9"                        "Etc/GMT+0"                       
[409] "Etc/GMT+1"                        "Etc/GMT+10"                      
[411] "Etc/GMT+11"                       "Etc/GMT+12"                      
[413] "Etc/GMT+2"                        "Etc/GMT+3"                       
[415] "Etc/GMT+4"                        "Etc/GMT+5"                       
[417] "Etc/GMT+6"                        "Etc/GMT+7"                       
[419] "Etc/GMT+8"                        "Etc/GMT+9"                       
[421] "Etc/GMT0"                         "Etc/Greenwich"                   
[423] "Etc/UCT"                          "Etc/Universal"                   
[425] "Etc/UTC"                          "Etc/Zulu"                        
[427] "Europe/Amsterdam"                 "Europe/Andorra"                  
[429] "Europe/Astrakhan"                 "Europe/Athens"                   
[431] "Europe/Belfast"                   "Europe/Belgrade"                 
[433] "Europe/Berlin"                    "Europe/Bratislava"               
[435] "Europe/Brussels"                  "Europe/Bucharest"                
[437] "Europe/Budapest"                  "Europe/Busingen"                 
[439] "Europe/Chisinau"                  "Europe/Copenhagen"               
[441] "Europe/Dublin"                    "Europe/Gibraltar"                
[443] "Europe/Guernsey"                  "Europe/Helsinki"                 
[445] "Europe/Isle_of_Man"               "Europe/Istanbul"                 
[447] "Europe/Jersey"                    "Europe/Kaliningrad"              
[449] "Europe/Kiev"                      "Europe/Kirov"                    
[451] "Europe/Kyiv"                      "Europe/Lisbon"                   
[453] "Europe/Ljubljana"                 "Europe/London"                   
[455] "Europe/Luxembourg"                "Europe/Madrid"                   
[457] "Europe/Malta"                     "Europe/Mariehamn"                
[459] "Europe/Minsk"                     "Europe/Monaco"                   
[461] "Europe/Moscow"                    "Europe/Nicosia"                  
[463] "Europe/Oslo"                      "Europe/Paris"                    
[465] "Europe/Podgorica"                 "Europe/Prague"                   
[467] "Europe/Riga"                      "Europe/Rome"                     
[469] "Europe/Samara"                    "Europe/San_Marino"               
[471] "Europe/Sarajevo"                  "Europe/Saratov"                  
[473] "Europe/Simferopol"                "Europe/Skopje"                   
[475] "Europe/Sofia"                     "Europe/Stockholm"                
[477] "Europe/Tallinn"                   "Europe/Tirane"                   
[479] "Europe/Tiraspol"                  "Europe/Ulyanovsk"                
[481] "Europe/Uzhgorod"                  "Europe/Vaduz"                    
[483] "Europe/Vatican"                   "Europe/Vienna"                   
[485] "Europe/Vilnius"                   "Europe/Volgograd"                
[487] "Europe/Warsaw"                    "Europe/Zagreb"                   
[489] "Europe/Zaporozhye"                "Europe/Zurich"                   
[491] "Factory"                          "GB"                              
[493] "GB-Eire"                          "GMT"                             
[495] "GMT-0"                            "GMT+0"                           
[497] "GMT0"                             "Greenwich"                       
[499] "Hongkong"                         "HST"                             
[501] "Iceland"                          "Indian/Antananarivo"             
[503] "Indian/Chagos"                    "Indian/Christmas"                
[505] "Indian/Cocos"                     "Indian/Comoro"                   
[507] "Indian/Kerguelen"                 "Indian/Mahe"                     
[509] "Indian/Maldives"                  "Indian/Mauritius"                
[511] "Indian/Mayotte"                   "Indian/Reunion"                  
[513] "Iran"                             "Israel"                          
[515] "Jamaica"                          "Japan"                           
[517] "Kwajalein"                        "Libya"                           
[519] "MET"                              "Mexico/BajaNorte"                
[521] "Mexico/BajaSur"                   "Mexico/General"                  
[523] "MST"                              "MST7MDT"                         
[525] "Navajo"                           "NZ"                              
[527] "NZ-CHAT"                          "Pacific/Apia"                    
[529] "Pacific/Auckland"                 "Pacific/Bougainville"            
[531] "Pacific/Chatham"                  "Pacific/Chuuk"                   
[533] "Pacific/Easter"                   "Pacific/Efate"                   
[535] "Pacific/Enderbury"                "Pacific/Fakaofo"                 
[537] "Pacific/Fiji"                     "Pacific/Funafuti"                
[539] "Pacific/Galapagos"                "Pacific/Gambier"                 
[541] "Pacific/Guadalcanal"              "Pacific/Guam"                    
[543] "Pacific/Honolulu"                 "Pacific/Johnston"                
[545] "Pacific/Kanton"                   "Pacific/Kiritimati"              
[547] "Pacific/Kosrae"                   "Pacific/Kwajalein"               
[549] "Pacific/Majuro"                   "Pacific/Marquesas"               
[551] "Pacific/Midway"                   "Pacific/Nauru"                   
[553] "Pacific/Niue"                     "Pacific/Norfolk"                 
[555] "Pacific/Noumea"                   "Pacific/Pago_Pago"               
[557] "Pacific/Palau"                    "Pacific/Pitcairn"                
[559] "Pacific/Pohnpei"                  "Pacific/Ponape"                  
[561] "Pacific/Port_Moresby"             "Pacific/Rarotonga"               
[563] "Pacific/Saipan"                   "Pacific/Samoa"                   
[565] "Pacific/Tahiti"                   "Pacific/Tarawa"                  
[567] "Pacific/Tongatapu"                "Pacific/Truk"                    
[569] "Pacific/Wake"                     "Pacific/Wallis"                  
[571] "Pacific/Yap"                      "Poland"                          
[573] "Portugal"                         "PRC"                             
[575] "PST8PDT"                          "ROC"                             
[577] "ROK"                              "Singapore"                       
[579] "SystemV/AST4"                     "SystemV/AST4ADT"                 
[581] "SystemV/CST6"                     "SystemV/CST6CDT"                 
[583] "SystemV/EST5"                     "SystemV/EST5EDT"                 
[585] "SystemV/HST10"                    "SystemV/MST7"                    
[587] "SystemV/MST7MDT"                  "SystemV/PST8"                    
[589] "SystemV/PST8PDT"                  "SystemV/YST9"                    
[591] "SystemV/YST9YDT"                  "Turkey"                          
[593] "UCT"                              "Universal"                       
[595] "US/Alaska"                        "US/Aleutian"                     
[597] "US/Arizona"                       "US/Central"                      
[599] "US/East-Indiana"                  "US/Eastern"                      
[601] "US/Hawaii"                        "US/Indiana-Starke"               
[603] "US/Michigan"                      "US/Mountain"                     
[605] "US/Pacific"                       "US/Samoa"                        
[607] "UTC"                              "W-SU"                            
[609] "WET"                              "Zulu"                            
attr(,"Version")
[1] "2024a"
as.numeric(today())
[1] 19887
as_date(1)
[1] "1970-01-02"
dates_class |> 
    select(today) |> 
    mutate(year = year(today)) |>
    mutate(month = month(today, label = TRUE, abbr = FALSE, locale = "fr_FR")) |> 
    mutate(quarter = quarter(today)) |>
    mutate(week = week(today)) |>
    mutate(day = day(today)) |>
    mutate(day_of_week = wday(today, label = TRUE, abbr = FALSE, locale = "fr_FR")) |>
    mutate(day_of_year = yday(today)) |>
    mutate(week_of_year = week(today)) 
# A tibble: 1 × 9
  today       year month quarter  week   day day_of_week day_of_year
  <date>     <dbl> <ord>   <int> <dbl> <int> <ord>             <dbl>
1 2024-06-13  2024 June        2    24    13 Thursday            165
# ℹ 1 more variable: week_of_year <dbl>

Module 05c: Tables

library(tidyverse)
library(gt)
library(gtsummary)
library(knitr)
library(DT)

Import

We continue to work with a subset of the “What a Waste” database.

waste_gt <- read_rds("/cloud/project/data/processed/waste-city-level-sml.rds")

Transform

waste_tbl_income <- waste_gt |> 
    filter(!is.na(generation_kg_capita))  |> 
    group_by(income_cat) |> 
    summarise(
        count = n(),
        mean = mean(generation_kg_capita),
        sd = sd(generation_kg_capita),
        median = median(generation_kg_capita),
        min = min(generation_kg_capita),
        max = max(generation_kg_capita)
    )

Table

waste_tbl_income
# A tibble: 4 × 7
  income_cat          count  mean    sd median    min   max
  <fct>               <int> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 high income            71  477.  214.   421. 116.   1142.
2 upper-middle income    72  381.  133.   378. 130.    828.
3 lower-middle income   116  275.  179.   219.  62.1  1109.
4 low income             67  215.  130.   182.   6.86  694.
waste_tbl_income |> 
    gt() |> 
    tab_header(title = "Waste generation per capita (kg/year) by income group",
               subtitle = "Data from 326 cities") |>
    fmt_number(columns = count:max, decimals = 0) |> 
    cols_label(income_cat = "income category")
Waste generation per capita (kg/year) by income group
Data from 326 cities
income category count mean sd median min max
high income 71 477 214 421 116 1,142
upper-middle income 72 381 133 378 130 828
lower-middle income 116 275 179 219 62 1,109
low income 67 215 130 182 7 694

Table 1 highlights that cities in countries classfied as high income countries generate more waste per capita than cities in lower income countries.

waste_tbl_income |> 
    rename(`income category` = income_cat) |>
    kable(digits = 0)
Table 1: Waste generation per capita (kg/year) by income group. Data from 326 cities.
income category count mean sd median min max
high income 71 477 214 421 116 1142
upper-middle income 72 381 133 378 130 828
lower-middle income 116 275 179 219 62 1109
low income 67 215 130 182 7 694

Module 06a: Cross-references

Tables and Figures

library(tidyverse)
library(ggthemes)
library(palmerpenguins)
library(gt)

Task 1: Tables

  1. Render the document and identify if the cross-reference to the table generated from the code below works.

  2. Fix the label in the code-chunk below so that the cross-reference works.

  3. Render the document to check if the cross-reference to the table generated from the code below works

See Table 2 for data on a few penguins.

penguins |> 
  filter(!is.na(bill_depth_mm)) |> 
  group_by(island, species) |>
  summarise(n = n(),
            mean_bill_depth = mean(bill_depth_mm),
            sd_bill_depth = sd(bill_depth_mm)) |>
  ungroup() |> 
  gt() |> 
  fmt_number(columns = c(mean_bill_depth, sd_bill_depth),
             decimals = 1)
Table 2: Bill depth of penquins by island and species.
island species n mean_bill_depth sd_bill_depth
Biscoe Adelie 44 18.4 1.2
Biscoe Gentoo 123 15.0 1.0
Dream Adelie 56 18.3 1.1
Dream Chinstrap 68 18.4 1.1
Torgersen Adelie 51 18.4 1.3

Task 2: Figures

  1. Add a caption and a label for a figure to the code chunk options below.
  2. Add a cross-reference to the figure generated from the code below.

In Figure 1, we see that …

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm,
                     color = species,
                     shape = species)) +
  geom_point() +
  scale_color_colorblind() +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)") +
  theme_minimal()
Figure 1: Bill length and depth of penguins

Module 06b: Vector types”

library(tidyverse)
library(gapminder)

Part 1: (Atomic) Vectors

Atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw.

Integer and double vectors are collectively known as numeric vectors.

  • lgl: logical
  • int: integer
  • dbl: double
  • chr: character
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Types of atomic vectors

vector_lgl <- c(TRUE, TRUE, FALSE)
typeof(vector_lgl)
[1] "logical"
sum(vector_lgl)
[1] 2
as.numeric(vector_lgl)
[1] 1 1 0
vector_int <- c(1L, 3L, 6L)
typeof(vector_int)
[1] "integer"
vector_dbl <- c(1293, 5.1, 90.5)
typeof(vector_dbl)
[1] "double"
vector_chr <- c("large", "small", "medium")
typeof(vector_chr)
[1] "character"

Logical vectors

vector_dbl > 150
[1]  TRUE FALSE FALSE
"large" == vector_chr
[1]  TRUE FALSE FALSE
str_detect(vector_chr, "lar")
[1]  TRUE FALSE FALSE

Explicit vector coercion & augmented vectors

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behavior. For example, factors are built on top of integer vectors.

vector_fct <- factor(vector_chr, levels = c("small", "medium", "large"))

typeof(vector_fct)
[1] "integer"
attributes(vector_fct)
$levels
[1] "small"  "medium" "large" 

$class
[1] "factor"
as.integer(vector_fct)
[1] 3 1 2

Tibbles / Dataframes

Tibbles / dataframes have vectors as columns. Each vector has the same length. Each vector can be thought of as a column and the length of each vector is the number of rows.

tib_data <- tibble(
  vector_lgl,
  vector_int,
  vector_dbl,
  vector_chr,
  vector_fct,
  date = Sys.Date()
)

Accessing a vector from a dataframe

tib_data |> 
  pull(vector_fct)
[1] large  small  medium
Levels: small medium large
tib_data$vector_fct
[1] large  small  medium
Levels: small medium large
tib_data[5]
# A tibble: 3 × 1
  vector_fct
  <fct>     
1 large     
2 small     
3 medium    
tib_data[[5]]
[1] large  small  medium
Levels: small medium large

Part 2: Programming with R

For loops

Iterate code for each element in a vector.

size <- tib_data$vector_fct

for (s in size) {
  msg <- paste(
    "------", s, "------"
  )
  print(msg) 
}
[1] "------ large ------"
[1] "------ small ------"
[1] "------ medium ------"

If statement

pet <- c("bat", "cat", "dog", "bird", "horse")

for(p in pet) {
  if(p == "dog") {
    msg <- paste("A", p, "is the best!")
  } else {
    msg <- paste("A", p, "is okay I guess.")
  }
  print(msg) 
}
[1] "A bat is okay I guess."
[1] "A cat is okay I guess."
[1] "A dog is the best!"
[1] "A bird is okay I guess."
[1] "A horse is okay I guess."
sounds <- c(NA, "meow", "woof", "chirp", "neigh")

for (i in seq_along(pet)) {
  if (pet[i] == "dog") {
    message <- paste("The", pet[i], "goes", sounds[i])
  } else {
    message <- paste("The", pet[i], "says", sounds[i])
  }
  print(message)
}
[1] "The bat says NA"
[1] "The cat says meow"
[1] "The dog goes woof"
[1] "The bird says chirp"
[1] "The horse says neigh"

Module 06c: Excercises

library(tidyverse)
library(nycflights13)

Task 1: Numeric vector

  1. Create a numeric vector using c() with the numbers from 1, 2, 3, 4, 5 to 10. Run the code.

  2. Create a numeric vector using seq(1, 10) and run the code.

  3. What’s the difference between the two vectors?

c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

Task 2: Character vector

  1. Create a character vector using c() with the letters from “a” to “f”. Run the code.

  2. On a new line, write letters and run the code. What’s stored in the letters object?

  3. On a new line, write ?letters and run the code. What did you learn?

c("a", "b", "c", "d", "e", "f")
[1] "a" "b" "c" "d" "e" "f"
letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

Task 3: Numeric sequences

  1. Create a numeric vector using seq(1, 100, 1) and run the code. What does the code do?

  2. Create a numeric vector using runif(100, 1, 100) and run the code. What does the code do?

  3. Create a numeric vector using sample(1:100, 100, replace = FALSE) and run the code. What does the code do?

seq(1, 100, 1)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
runif(100, 1, 100)
  [1] 38.778210 58.330600 27.899647  7.796088 81.541475 71.885980 84.836257
  [8] 29.604210 66.357969 17.926685 20.439570 50.686240 44.896382 20.140063
 [15] 16.415615 35.256813 15.242959 75.188393 24.823551 97.258622 60.565691
 [22] 41.860249 25.692112 10.665276 94.760276  1.062801 44.621982 73.434445
 [29] 95.337043 94.564929 76.204060 44.218349 84.533288 91.212240 19.745296
 [36] 80.452509 69.095412 80.497132 58.705038 74.817285 16.170162 50.574742
 [43] 43.679575 22.219881 19.881512 33.371115 30.016625 90.680238 98.509094
 [50] 14.478689  6.739723 26.671297 92.227470 89.322627  2.762168 63.323627
 [57] 81.134110 68.450249 94.252972 48.517861  1.506401 34.999222 23.508857
 [64]  8.259930 97.435193 67.243346 58.083158 42.099956 94.861612  3.545431
 [71] 20.845850 39.500589 92.115329 96.127512 56.378400  4.923146 72.407167
 [78] 80.214474  8.567918 45.176046  4.057912  7.897774 79.921334 72.203024
 [85] 75.633223 99.979251 64.457892 48.543277 70.607990 17.109271 54.824040
 [92] 79.411269 47.832787 16.182316 43.531321 51.209082 40.532696 91.270367
 [99] 86.211758 21.596911
sample(1:100, 100, replace = FALSE)
  [1]  74  27  26  58  54  64  23  50  73  29  69  22   5  25  85  66  68  28
 [19]  59  88  87  72  44   4  75  57  80   7   2  63  37  65  34  43  67  92
 [37]  15  97  31  55  49  39  62  83  79  36  24  19  12  89  93  13  32   6
 [55]  14  98  82   3  35  47  21  16  71  20  99  81  30  91  42 100  52   1
 [73]  77  60  33  17  56  10  61  76   8  78  18  48  40  46  96  41   9  95
 [91]  86  90  84  51  53  11  45  70  38  94
seq_along(letters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26

Task 4: Numeric sequences along a character vector

  1. Create a numeric vector using seq_along(letters) and run the code. What does the code do?

  2. Create a character vector using month.name and run the code. What does the code do?

  3. Create a numeric vector using seq_along(month.name) and run the code. What does the code do?

seq_along(letters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
month.name
 [1] "January"   "February"  "March"     "April"     "May"       "June"     
 [7] "July"      "August"    "September" "October"   "November"  "December" 
seq_along(month.name)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

Task 5: Distribution of random numbers

  1. Create a numeric vector runif(n = 1000, min = 1, max = 100) |> hist() and run the code. What does the code do? Remove |> hist() and run the code again. What does the code do?

  2. Create a numeric vector rnorm(n = 1000, mean = 500, sd = 150) |> hist() and run the code. What does the code do? Remove |> hist() and run the code again. What does the code do?

runif(n = 1000, min = 1, max = 100) |> hist()

rnorm(n = 1000, mean = 500, sd = 150) |> hist()

Task 6: Logical vectors

  1. Create a numeric vector using rnorm(n = 1000, mean = 50, sd = 5) and use the assignment operator to store it in an object called norm100. Run the code.

  2. Write:

  • mean(norm100) and run the code. What does the code do?
  • norm100 >= 50 and run the code. What does the code do?
  • sum(norm100 >= 50) and run the code. What does the code do?
  • mean(norm100 >= 50) and run the code. What does the code do?
norm_dist <- rnorm(n = 1000, mean = 50, sd = 5) 

mean(norm_dist)
[1] 50.14718
norm_dist >= 50
   [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
  [13]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
  [25] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
  [37]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
  [49]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
  [61] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
  [73]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
  [85]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
  [97]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
 [109]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
 [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [133] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
 [145]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
 [157] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
 [169]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
 [181] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
 [193]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
 [205]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
 [217]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
 [229]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 [241] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [253] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
 [265] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
 [277]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
 [289] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [301]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
 [313]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
 [325]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [337]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
 [349]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
 [361]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
 [373] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
 [385] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
 [397] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [409]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
 [421]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [433]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
 [445]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
 [457]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [469] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
 [481]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [493]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE
 [505] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
 [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
 [529] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
 [541]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
 [553]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
 [565] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
 [577] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 [589]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 [601]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
 [613]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
 [625] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
 [637]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
 [649] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [661] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
 [673]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
 [685]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
 [697]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
 [709]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
 [721]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
 [733]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
 [745] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [757] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
 [769]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
 [781] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
 [793] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
 [805] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
 [817]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
 [829] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 [841]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
 [853]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [865] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
 [877] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [889]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [901]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
 [913]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
 [925]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
 [937]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
 [949]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [961] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
 [973]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
 [985]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
 [997]  TRUE FALSE  TRUE FALSE
sum(norm_dist >= 50)
[1] 523
mean(norm_dist >= 50)
[1] 0.523

Task 7 (stretch goal)

In this task, we will use the flights data object of the nycflights13 package. The flights data object contains information about all flights that departed from NYC (e.g., EWR, JFK and LGA) in 2013. The data object contains 336,776 rows and 19 columns.

  1. Use the flights data object with mutate() to create delayed, a variable that displays whether a flight was delayed (arr_delay > 0).

  2. Use relocate to move delayed to the front of the data frame. Run the code. What vector type is the delayed variable?

  3. Then, remove all rows that contain an NA in delayed.

  4. Finally, create a summary table with summarise() that shows

  • How many flights were delayed
  • What proportion of flights were delayed
flights |> 
  mutate(delayed = arr_delay > 0) |> 
  relocate(delayed) |> 
  filter(!is.na(delayed)) |> 
  summarise(sum = sum(delayed),
            prop = mean(delayed))
# A tibble: 1 × 2
     sum  prop
   <int> <dbl>
1 133004 0.406

Assignment 06: Data formats

Part 1: Data preparation

Task 1: Load packages

The required packages for this homework exercise have already been added.

  1. Run the code chunk below to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
  2. What’s the tidyverse Package? Describe in maximum two sentences below.
library(tidyverse)

Task 2: Import data

  1. Use the read_csv() (Note: Watch out for the _ and don’t use the . as in read.csv()) function to import the “msw-generation-and-composition-by-income.csv” data from the data directory and assign it to an object with the name waste_data.
waste_data = read_csv(file = "/cloud/project/data/msw-generation-and-composition-by-income.csv")

Task 3: Vector coercion

  1. Use waste_data and count() to create a frequency table for the income_cat variable.

  2. Use then c() function to create a vector with a sensible order for the the values in income_cat. Use the assignment operator <- to assign the vector to an object with the name levels_income_cat.

  3. Starting with the waste_data object, use the pipe operator and the mutate() function to convert the income_cat variable from a variable of type character to a variable of type factor. Use the levels you defined in the previous step.ories using the following code to identify the correct spelling of the categories in the variable income_cat.

  4. Assign the created data frame to an object with the name waste_data_fct.

  5. Render and fix any errors

# Create a frequency table for the income_cat variable
waste_data %>% 
  count(income_cat)
# A tibble: 4 × 2
  income_cat              n
  <chr>               <int>
1 high income            81
2 low income             33
3 lower-middle income    47
4 upper-middle income    56
#Create vector with a sensible order for the values in income_cat
levels_income_cat <- c(
  "Low income",
  "Medium income",
  "High income"
)

#Conversion of data type
waste_data_fct <- waste_data %>%
  mutate(income_cat = factor(income_cat, levels = levels_income_cat))

Task 4: From wide to long

  1. Starting with the waste_data_fct object, use the pivot_longer() function to convert the data frame from a wide to a long format. Apply the following:
  • bring all columns from food_organic_waste to yard_garden_green_waste into a long format
  • send the variable names to a column named “waste_category”
  • send the values of the variables to a column named “percent”
  1. Remove all NAs from the percent variable

  2. Assign the created data frame to an object with the name waste_data_long

  3. Render and fix any errors

waste_data_long <- waste_data_fct %>% 
  pivot_longer(cols = food_organic_waste:yard_garden_green_waste,
               names_to = "waste_category", 
               values_to = "percent") %>% 
  filter(!is.na(percent))  # Remove rows where percent column is NA

Part 2: Data summary

Task 1: Import data

I have stored the data that I would have expected at the end of the previous task and import it here.

  1. Run the code in the code chunk below.
waste_data_long <- read_rds("/cloud/project/data/msw-generation-and-composition-by-income-long.rds")

Task 2: Summarise data

  1. Starting with waste_data_long, group the data byincome_cat and waste_category, then create a summary table containing the mean of percentages (call this mean_percent) for each group.

    • could this be done with a for-loop?
  2. Assign the created data frame to an object with the name waste_data_long_mean.

waste_data_long_mean <- waste_data_long %>%
  group_by(income_cat, waste_category) %>%
  summarise(mean_percent = mean(percent))

Task 3: Table display

  1. Starting with the waste_data_long_mean object, execute the code and observe the output in the Console. Would you publish this table in this format in a report?
    • No, while it is in the long format, it still contains NA values. Before publishing I would try to porperly code the variable “income_cat”
waste_data_long_mean
# A tibble: 36 × 3
# Groups:   income_cat [4]
   income_cat          waste_category          mean_percent
   <fct>               <chr>                          <dbl>
 1 high income         food_organic_waste             32.8 
 2 high income         glass                           6.12
 3 high income         metal                           5.13
 4 high income         other                          16.8 
 5 high income         paper_cardboard                21.3 
 6 high income         plastic                        12.4 
 7 high income         rubber_leather                  2.98
 8 high income         wood                            5.54
 9 high income         yard_garden_green_waste         9.31
10 upper-middle income food_organic_waste             45.4 
# ℹ 26 more rows

Task 4: From long to wide

  1. Starting with the waste_data_long_mean object, use the pipe operator to add another line of code which uses the pivot_wider() function to bring the data from a long format into a wide format using names for variables from waste_category and corresponding values from mean_percent

  2. Execute the code and observe the output in the Console. Would you publish this table in a report in this format?

    • For a report I find it more intelligible, however no histograms can be plotted and therefore I would not publish it in that way.
  3. Render and fix any errors

waste_data_long_mean %>% 
  pivot_wider(names_from = waste_category,
              values_from = mean_percent)
# A tibble: 4 × 10
# Groups:   income_cat [4]
  income_cat        food_organic_waste glass metal other paper_cardboard plastic
  <fct>                          <dbl> <dbl> <dbl> <dbl>           <dbl>   <dbl>
1 high income                     32.8  6.12  5.13  16.8           21.3    12.4 
2 upper-middle inc…               45.4  4.42  3.88  18.2           12.1    12.3 
3 lower-middle inc…               50.4  3.68  3.92  16.5           10.6    11.1 
4 low income                      50.9  1.94  2.58  28.5            8.35    7.97
# ℹ 3 more variables: rubber_leather <dbl>, wood <dbl>,
#   yard_garden_green_waste <dbl>

Part 3: Data visualization

Task 1: Import data

I have stored the data that I would have expected at the end of the previous task and import it here.

  1. Run the code in the code chunk below.
waste_data_long_mean <- read_rds("/cloud/project/data/msw-generation-and-composition-by-income-long-mean.rds")

Task 2: Reproduce a plot

  1. Render and fix any errors.

  2. Reproduce the plot that you see as an image below when you render the report and view the output in your Viewer tab in the bottom right window.

Hint: To get those bars displayed next to each other, use the geom_col() function and apply the position = position_dodge() argument and value. The colors don’t have to be exactly the same colors, just not the default color scale

Note: The size of the plot will be different. That is alright and does not need to match.

ggplot(data = waste_data_long_mean,
       aes(x = mean_percent,
           y = waste_category,
           fill = income_cat)) +
  geom_col(position = position_dodge(0.9), na.rm = TRUE) +
  labs(x = "Income Category", y = "Mean Percent", fill = "Waste Category") +
  theme_minimal()

Module 07: Writing scholarly articles

Scholarly writing

Scholarly articles require much more detail in their front matter than simply a title and an author. Quarto provides a rich set of YAML metadata keys to describe these details. You can copy & paste from this example to your own report.

Task 1: Front Matter

  • Replace the values under author for name, orcid, email, and affiliation with your own
  • Render the document to see the changes

Task 2: Citations

  1. Add the citation key for the paper “‘My flight arrives at 5 am, can you pick me up?’: The gatekeeping burden of the african academic” as an in-text reference to the sentence below

In @tilley2021my, the authors describe how visitors still expect a personal pick-up, despite the availability of taxi services.

  1. Add the citation key for the paper “‘The rich will always be able to dispose of their waste’: a view from the frontlines of municipal failure in Makhanda, South Africa” as a citation at the end of the sentence below.

Inequality underpins waste management systems, structuring who can or cannot access services [@kalina2023rich].

Bibliographies

Your folder already contains a references.bib file. One way of creating and adding to this file is by using the RStudio Visual Editor mode. Another way is by exporting a collection in Zotero reference management tool. Part of your homework will be to setup Zotero. For your literature research, you will then use Zotero and in your final report use an exported .bib file to cite references in your report. https://rbtl-fs24.github.io/website/project