library(ggplot2)
library(dplyr)
library(gapminder)
library(gt)
Exam Preparation
Module 01: Getting Started with R
Introduction
Data
Data can be imported from many different sources. In this exercise, we import data from:
- an R Package that is loaded via the
library()
function.
Gapminder data
For this analysis we’ll use the Gapminder dataset from the gapminder R package.
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Population
The table below shows a summary for the population grouped by continent.
|>
gapminder filter(year == 2007) |>
group_by(continent) |>
summarise(
mean_life_exp = mean(pop)
|>
) gt()
continent | mean_life_exp |
---|---|
Africa | 17875763 |
Americas | 35954847 |
Asia | 115513752 |
Europe | 19536618 |
Oceania | 12274974 |
Life expectancy
<- gapminder |>
gapminder_2007 filter(year == 2007)
ggplot(data = gapminder_2007,
mapping = aes(x = continent,
y = lifeExp)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 1/4, size = 3) +
labs(x = NULL,
y = "life expectancy") +
theme_minimal()
Module 02a: Data visualization with ggplot2
Import
library(ggplot2)
library(ggthemes)
library(ggridges)
library(palmerpenguins)
Explore
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Visualize with ggplot2
Functions and arguments
- functions: ggplot(), aes(), geom_point()
- arguments: data, mapping, color
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point()
Aesthetic mappings
- options: x, y, color, shape, size, alpha
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = species)) +
geom_point()
Settings
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = species)) +
geom_point(size = 5, alpha = 0.7)
Color scales
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = species)) +
geom_point(size = 5, alpha = 0.7) +
scale_color_colorblind()
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = species)) +
geom_point(size = 5, alpha = 0.7) +
scale_color_manual(values = c("red", "blue", "green"))
Facets
Keyboard shortcut for the tilde (~) varies by keyboard layout:
- US keyboard Windows/Mac: Shift + ` (top left of your keyboard next to the 1)
- UK keyboard Windows/Mac: Shift + # (bottom right of your keyboard, next to Enter)
- CH keyboard Windows/Max: Alt/Option + -
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point() +
facet_grid(species ~ island)
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point() +
facet_wrap(~species)
Themes
Some code in this section is already prepared, we will add more code together.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = species)) +
geom_point(size = 5, alpha = 0.7) +
scale_color_colorblind() +
theme_minimal()
Visualizing distributions
Categorical variables
ggplot(data = penguins,
mapping = aes(x = species)) +
geom_bar()
ggplot(data = penguins,
mapping = aes(x = species,
fill = island)) +
geom_bar()
Numerical continuous variables
The code in this section is already prepared, we will run through the code chunks together.
ggplot(data = penguins,
mapping = aes(x = body_mass_g)) +
geom_histogram()
ggplot(data = penguins,
mapping = aes(x = body_mass_g,
fill = species)) +
geom_histogram()
ggplot(data = penguins,
mapping = aes(x = body_mass_g,
fill = species)) +
geom_density()
ggplot(data = penguins,
mapping = aes(x = body_mass_g,
y = species,
fill = species)) +
geom_density_ridges()
Module 02b: Working with R
Import
library(ggplot2)
library(dplyr)
library(gapminder)
Explore
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
tail(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Zimbabwe Africa 1982 60.4 7636524 789.
2 Zimbabwe Africa 1987 62.4 9216418 706.
3 Zimbabwe Africa 1992 60.4 10704340 693.
4 Zimbabwe Africa 1997 46.8 11404948 792.
5 Zimbabwe Africa 2002 40.0 11926563 672.
6 Zimbabwe Africa 2007 43.5 12311143 470.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
nrow(gapminder)
[1] 1704
ncol(gapminder)
[1] 6
Transform - Narrow down
<- gapminder |>
gapminder_2007 filter(year == 2007)
- Keyboard shortcut for pipe operator: Ctrl / Cmd + Shift + M
- Keyboard shortcut for assignment operator: Alt + -
Visualize
ggplot(data = gapminder_2007,
mapping = aes(x = continent,
y = lifeExp)) +
geom_boxplot()
Module 02c: Make a plot
Task 1: Import
The required packages for this homework exercise have already been added.
- Run the code chunk with the label ‘load-packages’ to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
library(gapminder)
library(ggplot2)
library(dplyr)
Task 2: Transform for data in 2007
Below is a typical task description as you will find them in the homework assignments. For “Fill in the gaps” tasks, you should replace the underscores ___ with the described code and then change the value of the code block option from false to true. In other tasks, you will create your own code from scratch. Over time, the task descriptions will be become less detailed.
Fill in the gaps
A code chunk has already been created below.
Start with the
gapminder
object and add the pipe operator at the end of the line.On a new line use the
filter()
function to narrow down the data for observation of the year 2007.Use the assignment operator to assign the data to an object named
gapminder_2007
.Run the code contained in the code chunk and fix any errors.
Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
<- gapminder |>
gapminder_2007 filter(year == 2007)
Task 3: Create a boxplot
This is a typical task without any starter code.
Add a new code chunk below point 5.
Use the
ggplot()
function and thegapminder_2007
object to create a boxplot with the following aesthetic mappings:
- continent to the x-axis;
- life expectancy to the y-axis;
- continent to color using the
fill = continent
argument insideaes()
Run the code contained in the code chunk and fix any errors.
What are the data types of the three variables used for aesthetic mappings?
ggplot(data = gapminder_2007,
mapping = aes(x = continent,
y = lifeExp,
fill = continent)) +
geom_boxplot()
Assignment 02: Data Visualisation
Task 1: Import
The required packages for this homework exercise have already been added.
- Run the code chunk with the label ‘load-packages’ to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
library(gapminder)
library(ggplot2)
library(dplyr)
library(readr)
library(sf)
library(rnaturalearth)
Task 2: Transform for data in 2007
Fill in the gaps
A code chunk has already been created below.
Start with the
gapminder
object and add the pipe operator at the end of the line.On a new line use the
filter()
function to narrow down the data for observation of the year 2007.Use the assignment operator to assign the data to an object named
gapminder_2007
.Run the code contained in the code chunk and fix any errors.
Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
<- gapminder |>
gapminder_2007 filter(year == 2007)
Task 3: Summarize data for life expectancy by continent
Fill in the gaps
A code chunk has already been created.
Start with the
gapminder_2007
object and add the pipe operator at the end of the line.On a new line use the
group_by()
function to group the operations that follow by continent. Add the pipe operator at the end of the line.On a new line use the
summarise()
function to calculate the number of observations (count) and median life expectancy.Use the assignment operator to assign the data to an object named
gapminder_summary_2007
.Run the code contained in the code chunk and fix any errors.
Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
<- gapminder_2007 |>
gapminder_summary_2007 group_by(continent) |>
summarise(
count = n(),
lifeExp = median(lifeExp)
)
Task 4: Summarize data for life expectancy by continent and year
Fill in the gaps
A code chunk has already been created.
Start with the
gapminder
object and add the pipe operator at the end of the line.On a new line use the
group_by()
function to group the operations that follow by continent and year. Add the pipe operator at the end of the line.On a new line use the
summarise()
function to calculate and median life expectancy.Use the assignment operator to assign the data to an object named
gapminder_summary_continent_year
Run the code contained in the code chunk and fix any errors.
Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
<- gapminder |>
gapminder_summary_continent_year group_by(continent, year) |>
summarise(lifeExp = median(lifeExp))
Task 5: Data visualization
Thank you for working through the previous tasks. We are convinced that you have done a great job, but because the task descriptions aren’t always unambiguous, we have imported the data that we would have expected to be created and stored in the objects gapminder_2007
, gapminder_summary_2007
and gapminder_summary_continent_year
at the previous code chunks. This is to ensure that you can work through the following tasks.
- Run the code contained in the code chunk below to import the data.
<- read_rds(here::here("/cloud/project/data/gapminder-2007.rds"))
gapminder_2007
<- read_rds(here::here("/cloud/project/data/gapminder-summary-2007.rds"))
gapminder_summary_2007
<- read_rds(here::here("/cloud/project/data/gapminder-summary-continent-year.rds")) gapminder_summary_continent_year
Task 6: Create a boxplot
A code chunk has already been created.
Use the
ggplot()
function and thegapminder_2007
object to create a boxplot with the following aesthetic mappings:
- continent to the x-axis;
- life expectancy to the y-axis;
- continent to color using the
fill = continent
argument insideaes()
Do not display (ignore) the outliers in the plot. Note: Use a search engine or an AI tool to find the solution and add the link to the solution you have found.
Run the code contained in the code chunk and fix any errors.
What are the data types of the three variables used for aesthetic mappings?
ggplot(data = gapminder_2007,
mapping = aes(x = continent,
y = lifeExp,
fill = continent)) +
geom_boxplot(outlier.shape = NA)
Task 7: Create a timeseries plot
A code chunk has already been created.
Use the
ggplot()
function and thegapminder_summary_continent_year
object to create a connected scatterplot (also called timeseries plot) using thegeom_line()
andgeom_point()
functions with the following aesthetic mappings:
- year to the x-axis;
- life expectancy to the y-axis;
- continent to color using the
color = continent
argument insideaes()
- Run the code contained in the code chunk and fix any errors.
ggplot(data = gapminder_summary_continent_year,
mapping = aes(x = year,
y = lifeExp,
color = continent)) +
geom_line() +
geom_point()
Task 8: Create a barplot
with geom_col()
A code chunk has already been created.
Use the
ggplot()
function and thegapminder_summary_2007
object to create a barplot using thegeom_col()
function with the following aesthetic mappings:
- continent to the x-axis;
- count to the y-axis;
- Run the code contained in the code chunk and fix any errors.
ggplot(data = gapminder_summary_2007,
mapping = aes(x = continent,
y = count)) +
geom_col()
with geom_bar()
A code chunk has already been created.
Use the
ggplot()
function and thegapminder_2007
object to create a barplot using thegeom_bar()
function with the following aesthetic mappings:
- continent to the x-axis;
Run the code contained in the code chunk and fix any errors.
The plot is identical to the plot created with
geom_col()
. Why? What does thegeom_bar()
function do? Write your text here:
ggplot(data = gapminder_2007,
mapping = aes(x = continent)) +
geom_bar()
Task 9: Create a histogram
A code chunk has already been created.
Use the
ggplot()
function and thegapminder_2007
object to create a histogram using thegeom_histogram()
function with the following aesthetic mappings:
- life expectancy to the x-axis;
- continent to color using the
fill = continent
argument insideaes()
Run the code contained in the code chunk and fix any errors.
Inside the
geom_histogram()
function, add the following arguments and values:
col = "grey30"
breaks = seq(40, 85, 2.5)
Run the code contained in the code chunk and fix any errors.
Describe how the
geom_histogram()
function is similar to thegeom_bar()
function.What happens by adding the ‘breaks’ argument? Play around with the numbers inside of
seq()
to see what changes. Describe here what you observe:
ggplot(data = gapminder_2007,
mapping = aes(x = lifeExp,
fill = continent)) +
geom_histogram(col = "grey30", breaks = seq(40, 85, 2.5))
Task 10: Scatterplot and faceting
A code chunk has already been created.
Use the
ggplot()
function and assigngapminder_2007
and create a scatterplot using thegeom_point()
function with the following aesthetic mappings:
- gdpPercap the x-axis;
- lifeExp to the y-axis;
- population to the size argument;
- country to color using the
color = continent
argument insideaes()
Run the code contained in the code chunk and fix any errors.
Use the variable continent to facet the plot by adding:
facet_wrap(~continent)
.Run the code contained in the code chunk and fix any errors.
ggplot(data = gapminder_2007,
mapping = aes(x = gdpPercap,
y = lifeExp,
size = pop,
color = country)) +
geom_point(show.legend = FALSE) +
facet_wrap(~continent)
Task 11: Create a lineplot and use facets
A code chunk with complete code has already been prepared.
Run the code contained in the code chunk and fix any errors.
Remove the ‘#’ sign at the line that starts with the
scale_color_manual()
functionWhat is stored in the
country_colors
object? Find out by executing the object in the Console (type it to the Console and hit enter). Do the same again, but with a question mark?country_colors
.Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
ggplot(data = gapminder,
mapping = aes(x = year,
y = lifeExp,
group = country,
color = country)) +
geom_line(lwd = 1, show.legend = FALSE) +
facet_wrap(~continent) +
# scale_color_manual(values = country_colors) +
theme_minimal()
Task 12: Create a choropleth Maps
You can also prepare maps with ggplot2
. It’s beyond the scope of the class to teach you the foundations of spatial data in R, but a popular package to work with spatial data is the sf
(Simple Features) R Package. The rnaturalearth
R Package facilitates world mapping by making Natural Earth map data more easily available to R users.
The code chunk below contains code for a world map that shows countries by income group. To view the map, do the following:
Run the code contained in the code chunk and fix any errors.
Next to the code chunk option
#| eval:
change the value fromfalse
totrue
.Render the document and fix any errors.
<- ne_countries(scale = "small", returnclass = "sf")
world
|>
world mutate(income_grp = factor(income_grp, ordered = T)) |>
ggplot(aes(fill = income_grp)) +
geom_sf() +
theme_void() +
theme(legend.position = "top") +
labs(fill = "Income Group:") +
guides(fill = guide_legend(nrow = 2, byrow = TRUE))
The code for the code chunk is taken from here: More here: https://bookdown.org/alhdzsz/data_viz_ir/maps.html
Working with spatial data in R
If you are interested in working with spatial data in R, then we recommend the following resources for further study:
- Geocompuation with R - Book: https://geocompr.robinlovelace.net/
- Simple Features for R - Article: https://r-spatial.github.io/sf/articles/sf1.html
- tmap: thematic maps in R - R Package: https://r-tmap.github.io/tmap/
Module 03a: Data transformation with dplyr
library(readr)
library(dplyr)
Import
In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.
The data set is available in the data
folder as a CSV file named jmp_wld_sanitation_long.csv
.
The data set contains the following variables:
name
: country nameiso3
: ISO3 country codeyear
: year of observationregion_sdg
: SDG regionresidence
: residence type (national, rural, urban)varname_short
: short variable name (JMP naming convention)varname_long
: long variable name (JMP naming convention)
We use the read_csv()
function to import the data set into R.
<- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv") sanitation
Explore
sanitation
# A tibble: 73,710 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
3 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
4 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
5 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
6 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
7 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
8 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
9 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
10 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
# ℹ 73,700 more rows
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long <chr> "basic sanitation services", "basic sanitation services"…
$ residence <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
Transform with dplyr
The dplyr
R Package aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with:
- Rows
- Columns
- Groups of rows
filter()
The function filter()
chooses rows based on column values. To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).
|>
sanitation filter(residence == "national")
# A tibble: 24,570 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
3 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
4 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
5 Afghanis… AFG 2000 Central a… san_sm safely mana… national NA
6 Afghanis… AFG 2001 Central a… san_bas basic sanit… national 21.9
7 Afghanis… AFG 2001 Central a… san_lim limited san… national 5.66
8 Afghanis… AFG 2001 Central a… san_unimp unimproved … national 46.7
9 Afghanis… AFG 2001 Central a… san_od no sanitati… national 25.8
10 Afghanis… AFG 2001 Central a… san_sm safely mana… national NA
# ℹ 24,560 more rows
|>
sanitation filter(residence != "national")
# A tibble: 49,140 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
3 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
4 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
5 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
6 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
7 Afghanis… AFG 2000 Central a… san_od no sanitati… rural 31.3
8 Afghanis… AFG 2000 Central a… san_od no sanitati… urban 6.51
9 Afghanis… AFG 2000 Central a… san_sm safely mana… rural NA
10 Afghanis… AFG 2000 Central a… san_sm safely mana… urban NA
# ℹ 49,130 more rows
|>
sanitation filter(residence == "national", iso3 == "SEN")
# A tibble: 105 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Senegal SEN 2000 Sub-Saharan… san_bas basic sanit… national 37.5
2 Senegal SEN 2000 Sub-Saharan… san_lim limited san… national 10.8
3 Senegal SEN 2000 Sub-Saharan… san_unimp unimproved … national 27.4
4 Senegal SEN 2000 Sub-Saharan… san_od no sanitati… national 24.4
5 Senegal SEN 2000 Sub-Saharan… san_sm safely mana… national 14.0
6 Senegal SEN 2001 Sub-Saharan… san_bas basic sanit… national 38.4
7 Senegal SEN 2001 Sub-Saharan… san_lim limited san… national 11.0
8 Senegal SEN 2001 Sub-Saharan… san_unimp unimproved … national 26.8
9 Senegal SEN 2001 Sub-Saharan… san_od no sanitati… national 23.7
10 Senegal SEN 2001 Sub-Saharan… san_sm safely mana… national 14.3
# ℹ 95 more rows
|>
sanitation filter(iso3 == "UGA" | iso3 == "PER" | iso3 == "IND")
# A tibble: 945 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 India IND 2000 Central and S… san_bas basic sanit… national 15.0
2 India IND 2000 Central and S… san_bas basic sanit… rural 2.25
3 India IND 2000 Central and S… san_bas basic sanit… urban 48.4
4 India IND 2000 Central and S… san_lim limited san… national 5.15
5 India IND 2000 Central and S… san_lim limited san… rural 0.515
6 India IND 2000 Central and S… san_lim limited san… urban 17.3
7 India IND 2000 Central and S… san_unimp unimproved … national 5.73
8 India IND 2000 Central and S… san_unimp unimproved … rural 5.00
9 India IND 2000 Central and S… san_unimp unimproved … urban 7.65
10 India IND 2000 Central and S… san_od no sanitati… national 74.1
# ℹ 935 more rows
|>
sanitation filter(iso3 %in% c("UGA", "PER", "IND"))
# A tibble: 945 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 India IND 2000 Central and S… san_bas basic sanit… national 15.0
2 India IND 2000 Central and S… san_bas basic sanit… rural 2.25
3 India IND 2000 Central and S… san_bas basic sanit… urban 48.4
4 India IND 2000 Central and S… san_lim limited san… national 5.15
5 India IND 2000 Central and S… san_lim limited san… rural 0.515
6 India IND 2000 Central and S… san_lim limited san… urban 17.3
7 India IND 2000 Central and S… san_unimp unimproved … national 5.73
8 India IND 2000 Central and S… san_unimp unimproved … rural 5.00
9 India IND 2000 Central and S… san_unimp unimproved … urban 7.65
10 India IND 2000 Central and S… san_od no sanitati… national 74.1
# ℹ 935 more rows
|>
sanitation filter(percent > 80)
# A tibble: 8,314 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Albania ALB 2000 Northern Am… san_bas basic sanit… national 89.5
2 Albania ALB 2000 Northern Am… san_bas basic sanit… rural 84.2
3 Albania ALB 2000 Northern Am… san_bas basic sanit… urban 96.9
4 Albania ALB 2001 Northern Am… san_bas basic sanit… national 90.0
5 Albania ALB 2001 Northern Am… san_bas basic sanit… rural 84.9
6 Albania ALB 2001 Northern Am… san_bas basic sanit… urban 97.0
7 Albania ALB 2002 Northern Am… san_bas basic sanit… national 90.6
8 Albania ALB 2002 Northern Am… san_bas basic sanit… rural 85.7
9 Albania ALB 2002 Northern Am… san_bas basic sanit… urban 97.0
10 Albania ALB 2003 Northern Am… san_bas basic sanit… national 91.2
# ℹ 8,304 more rows
|>
sanitation filter(percent <= 5)
# A tibble: 21,424 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
2 Afghanis… AFG 2001 Central a… san_lim limited san… rural 3.14
3 Afghanis… AFG 2002 Central a… san_lim limited san… rural 3.35
4 Afghanis… AFG 2003 Central a… san_lim limited san… rural 3.57
5 Afghanis… AFG 2004 Central a… san_lim limited san… rural 3.79
6 Afghanis… AFG 2005 Central a… san_lim limited san… rural 4.01
7 Afghanis… AFG 2005 Central a… san_od no sanitati… urban 4.86
8 Afghanis… AFG 2006 Central a… san_lim limited san… rural 4.22
9 Afghanis… AFG 2006 Central a… san_od no sanitati… urban 4.45
10 Afghanis… AFG 2007 Central a… san_lim limited san… rural 4.44
# ℹ 21,414 more rows
|>
sanitation filter(is.na(percent))
# A tibble: 19,743 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_sm safely mana… national NA
2 Afghanis… AFG 2000 Central a… san_sm safely mana… rural NA
3 Afghanis… AFG 2000 Central a… san_sm safely mana… urban NA
4 Afghanis… AFG 2001 Central a… san_sm safely mana… national NA
5 Afghanis… AFG 2001 Central a… san_sm safely mana… rural NA
6 Afghanis… AFG 2001 Central a… san_sm safely mana… urban NA
7 Afghanis… AFG 2002 Central a… san_sm safely mana… national NA
8 Afghanis… AFG 2002 Central a… san_sm safely mana… rural NA
9 Afghanis… AFG 2002 Central a… san_sm safely mana… urban NA
10 Afghanis… AFG 2003 Central a… san_sm safely mana… national NA
# ℹ 19,733 more rows
|>
sanitation filter(!is.na(percent))
# A tibble: 53,967 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
3 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
4 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
5 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
6 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
7 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
8 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
9 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
10 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
# ℹ 53,957 more rows
- Keyboard shortcut for vertical bar | (OR) in US/CH is: Shift + / (Windows) and Option + / (Mac)
- Keyboard shortcut for vertical bar | (OR) in UK: It’s complitcated
- Keyboard shortcut for pipe operator: Ctrl / Cmd + Shift + M
- Keyboard shortcut for assignment operator: Alt + -
Storing a resulting data frame as a new object
<- sanitation |>
sanitation_national_2020_sm filter(residence == "national",
== 2020,
year == "san_sm") varname_short
arrange()
The function arrange()
changes the order of the rows.
|>
sanitation_national_2020_sm arrange(percent)
# A tibble: 234 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Ethiopia ETH 2020 Sub-Sahar… san_sm safely mana… national 6.68
2 Togo TGO 2020 Sub-Sahar… san_sm safely mana… national 9.13
3 Chad TCD 2020 Sub-Sahar… san_sm safely mana… national 10.1
4 Madagasc… MDG 2020 Sub-Sahar… san_sm safely mana… national 10.4
5 Guinea-B… GNB 2020 Sub-Sahar… san_sm safely mana… national 12.2
6 North Ma… MKD 2020 Northern … san_sm safely mana… national 12.2
7 Democrat… COD 2020 Sub-Sahar… san_sm safely mana… national 12.7
8 Ghana GHA 2020 Sub-Sahar… san_sm safely mana… national 13.3
9 Central … CAF 2020 Sub-Sahar… san_sm safely mana… national 13.6
10 Sierra L… SLE 2020 Sub-Sahar… san_sm safely mana… national 14.0
# ℹ 224 more rows
|>
sanitation_national_2020_sm arrange(desc(percent))
# A tibble: 234 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Andorra AND 2020 Northern … san_sm safely mana… national 100.
2 Kuwait KWT 2020 Western A… san_sm safely mana… national 100
3 Monaco MCO 2020 Northern … san_sm safely mana… national 100
4 Singapore SGP 2020 Eastern a… san_sm safely mana… national 100
5 Republic… KOR 2020 Eastern a… san_sm safely mana… national 99.9
6 Switzerl… CHE 2020 Northern … san_sm safely mana… national 99.7
7 Austria AUT 2020 Northern … san_sm safely mana… national 99.6
8 United A… ARE 2020 Western A… san_sm safely mana… national 99.2
9 Liechten… LIE 2020 Northern … san_sm safely mana… national 98.8
10 United S… USA 2020 Northern … san_sm safely mana… national 98.3
# ℹ 224 more rows
select()
The select()
function chooses columns based on their names.
|>
sanitation_national_2020_sm select(name, percent)
# A tibble: 234 × 2
name percent
<chr> <dbl>
1 Afghanistan NA
2 Albania 47.7
3 Algeria 17.6
4 American Samoa NA
5 Andorra 100.
6 Angola NA
7 Anguilla NA
8 Antigua and Barbuda NA
9 Argentina NA
10 Armenia 69.3
# ℹ 224 more rows
|>
sanitation_national_2020_sm select(-varname_short)
# A tibble: 234 × 7
name iso3 year region_sdg varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AFG 2020 Central and S… safely mana… national NA
2 Albania ALB 2020 Northern Amer… safely mana… national 47.7
3 Algeria DZA 2020 Western Asia … safely mana… national 17.6
4 American Samoa ASM 2020 Oceania safely mana… national NA
5 Andorra AND 2020 Northern Amer… safely mana… national 100.
6 Angola AGO 2020 Sub-Saharan A… safely mana… national NA
7 Anguilla AIA 2020 Latin America… safely mana… national NA
8 Antigua and Barbuda ATG 2020 Latin America… safely mana… national NA
9 Argentina ARG 2020 Latin America… safely mana… national NA
10 Armenia ARM 2020 Western Asia … safely mana… national 69.3
# ℹ 224 more rows
|>
sanitation_national_2020_sm select(name:region_sdg, percent)
# A tibble: 234 × 5
name iso3 year region_sdg percent
<chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AFG 2020 Central and Southern Asia NA
2 Albania ALB 2020 Northern America and Europe 47.7
3 Algeria DZA 2020 Western Asia and Northern Africa 17.6
4 American Samoa ASM 2020 Oceania NA
5 Andorra AND 2020 Northern America and Europe 100.
6 Angola AGO 2020 Sub-Saharan Africa NA
7 Anguilla AIA 2020 Latin America and the Caribbean NA
8 Antigua and Barbuda ATG 2020 Latin America and the Caribbean NA
9 Argentina ARG 2020 Latin America and the Caribbean NA
10 Armenia ARM 2020 Western Asia and Northern Africa 69.3
# ℹ 224 more rows
rename()
The rename()
function changes the names of variables.
|>
sanitation rename(country = name)
# A tibble: 73,710 × 8
country iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
3 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
4 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
5 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
6 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
7 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
8 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
9 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
10 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
# ℹ 73,700 more rows
mutate()
The mutate()
function adds new variables based on existing variables or external data.
|>
sanitation mutate(prop = percent / 100)
# A tibble: 73,710 × 9
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
3 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
4 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
5 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
6 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
7 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
8 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
9 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
10 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
# ℹ 73,700 more rows
# ℹ 1 more variable: prop <dbl>
|>
sanitation mutate(id = seq(1:n()))
# A tibble: 73,710 × 9
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2000 Central a… san_bas basic sanit… national 21.9
2 Afghanis… AFG 2000 Central a… san_bas basic sanit… rural 19.3
3 Afghanis… AFG 2000 Central a… san_bas basic sanit… urban 30.9
4 Afghanis… AFG 2000 Central a… san_lim limited san… national 5.65
5 Afghanis… AFG 2000 Central a… san_lim limited san… rural 3.14
6 Afghanis… AFG 2000 Central a… san_lim limited san… urban 14.5
7 Afghanis… AFG 2000 Central a… san_unimp unimproved … national 46.7
8 Afghanis… AFG 2000 Central a… san_unimp unimproved … rural 46.3
9 Afghanis… AFG 2000 Central a… san_unimp unimproved … urban 48.1
10 Afghanis… AFG 2000 Central a… san_od no sanitati… national 25.8
# ℹ 73,700 more rows
# ℹ 1 more variable: id <int>
relocate()
|>
sanitation mutate(id = 1:n()) |>
relocate(id)
# A tibble: 73,710 × 9
id name iso3 year region_sdg varname_short varname_long residence
<int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 Afghanistan AFG 2000 Central a… san_bas basic sanit… national
2 2 Afghanistan AFG 2000 Central a… san_bas basic sanit… rural
3 3 Afghanistan AFG 2000 Central a… san_bas basic sanit… urban
4 4 Afghanistan AFG 2000 Central a… san_lim limited san… national
5 5 Afghanistan AFG 2000 Central a… san_lim limited san… rural
6 6 Afghanistan AFG 2000 Central a… san_lim limited san… urban
7 7 Afghanistan AFG 2000 Central a… san_unimp unimproved … national
8 8 Afghanistan AFG 2000 Central a… san_unimp unimproved … rural
9 9 Afghanistan AFG 2000 Central a… san_unimp unimproved … urban
10 10 Afghanistan AFG 2000 Central a… san_od no sanitati… national
# ℹ 73,700 more rows
# ℹ 1 more variable: percent <dbl>
|>
sanitation mutate(id = 1:n()) |>
relocate(id, .before = name)
# A tibble: 73,710 × 9
id name iso3 year region_sdg varname_short varname_long residence
<int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 Afghanistan AFG 2000 Central a… san_bas basic sanit… national
2 2 Afghanistan AFG 2000 Central a… san_bas basic sanit… rural
3 3 Afghanistan AFG 2000 Central a… san_bas basic sanit… urban
4 4 Afghanistan AFG 2000 Central a… san_lim limited san… national
5 5 Afghanistan AFG 2000 Central a… san_lim limited san… rural
6 6 Afghanistan AFG 2000 Central a… san_lim limited san… urban
7 7 Afghanistan AFG 2000 Central a… san_unimp unimproved … national
8 8 Afghanistan AFG 2000 Central a… san_unimp unimproved … rural
9 9 Afghanistan AFG 2000 Central a… san_unimp unimproved … urban
10 10 Afghanistan AFG 2000 Central a… san_od no sanitati… national
# ℹ 73,700 more rows
# ℹ 1 more variable: percent <dbl>
summarise()
The summarise()
function reduces multiple values down to a single summary.
|>
sanitation_national_2020_sm summarise()
# A tibble: 1 × 0
|>
sanitation_national_2020_sm summarise(mean_percent = mean(percent))
# A tibble: 1 × 1
mean_percent
<dbl>
1 NA
|>
sanitation_national_2020_sm summarise(mean_percent = mean(percent, na.rm = TRUE))
# A tibble: 1 × 1
mean_percent
<dbl>
1 60.3
|>
sanitation_national_2020_sm summarise(n = n(),
mean_percent = mean(percent, na.rm = TRUE))
# A tibble: 1 × 2
n mean_percent
<int> <dbl>
1 234 60.3
|>
sanitation_national_2020_sm filter(!is.na(percent)) |>
summarise(n = n(),
mean_percent = mean(percent),
sd_percent = sd(percent))
# A tibble: 1 × 3
n mean_percent sd_percent
<int> <dbl> <dbl>
1 120 60.3 29.9
group_by()
The group_by()
function is used to group the data by one or more variables.
|>
sanitation_national_2020_sm group_by(region_sdg)
# A tibble: 234 × 8
# Groups: region_sdg [8]
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Afghanis… AFG 2020 Central a… san_sm safely mana… national NA
2 Albania ALB 2020 Northern … san_sm safely mana… national 47.7
3 Algeria DZA 2020 Western A… san_sm safely mana… national 17.6
4 American… ASM 2020 Oceania san_sm safely mana… national NA
5 Andorra AND 2020 Northern … san_sm safely mana… national 100.
6 Angola AGO 2020 Sub-Sahar… san_sm safely mana… national NA
7 Anguilla AIA 2020 Latin Ame… san_sm safely mana… national NA
8 Antigua … ATG 2020 Latin Ame… san_sm safely mana… national NA
9 Argentina ARG 2020 Latin Ame… san_sm safely mana… national NA
10 Armenia ARM 2020 Western A… san_sm safely mana… national 69.3
# ℹ 224 more rows
|>
sanitation_national_2020_sm group_by(region_sdg) |>
summarise(n = n(),
mean_percent = mean(percent),
sd_percent = sd(percent))
# A tibble: 8 × 4
region_sdg n mean_percent sd_percent
<chr> <int> <dbl> <dbl>
1 Australia and New Zealand 2 78.2 5.61
2 Central and Southern Asia 14 NA NA
3 Eastern and South-Eastern Asia 18 NA NA
4 Latin America and the Caribbean 50 NA NA
5 Northern America and Europe 53 NA NA
6 Oceania 21 NA NA
7 Sub-Saharan Africa 51 NA NA
8 Western Asia and Northern Africa 25 NA NA
|>
sanitation_national_2020_sm filter(!is.na(percent)) |>
group_by(region_sdg) |>
summarise(n = n(),
mean_percent = mean(percent),
sd_percent = sd(percent))
# A tibble: 8 × 4
region_sdg n mean_percent sd_percent
<chr> <int> <dbl> <dbl>
1 Australia and New Zealand 2 78.2 5.61
2 Central and Southern Asia 5 58.2 21.5
3 Eastern and South-Eastern Asia 11 69.8 21.4
4 Latin America and the Caribbean 14 43.4 16.8
5 Northern America and Europe 44 81.9 19.9
6 Oceania 3 36.1 10.7
7 Sub-Saharan Africa 21 21.4 10.9
8 Western Asia and Northern Africa 20 62.7 29.5
count()
The count()
function is a convenient wrapper for group_by()
and summarise(n = n())
. You can prepare frequency tables with count()
.
|>
sanitation count(region_sdg)
# A tibble: 8 × 2
region_sdg n
<chr> <int>
1 Australia and New Zealand 630
2 Central and Southern Asia 4410
3 Eastern and South-Eastern Asia 5670
4 Latin America and the Caribbean 15750
5 Northern America and Europe 16695
6 Oceania 6615
7 Sub-Saharan Africa 16065
8 Western Asia and Northern Africa 7875
|>
sanitation count(varname_short)
# A tibble: 5 × 2
varname_short n
<chr> <int>
1 san_bas 14742
2 san_lim 14742
3 san_od 14742
4 san_sm 14742
5 san_unimp 14742
|>
sanitation count(varname_long)
# A tibble: 5 × 2
varname_long n
<chr> <int>
1 basic sanitation services 14742
2 limited sanitation services 14742
3 no sanitation facilities 14742
4 safely managed sanitation services 14742
5 unimproved sanitation facilities 14742
|>
sanitation count(varname_short, varname_long)
# A tibble: 5 × 3
varname_short varname_long n
<chr> <chr> <int>
1 san_bas basic sanitation services 14742
2 san_lim limited sanitation services 14742
3 san_od no sanitation facilities 14742
4 san_sm safely managed sanitation services 14742
5 san_unimp unimproved sanitation facilities 14742
Module 03b: Filter function
library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)
Import
In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.
The data set jmp_wld_sanitation_long
is available in the data
folder of this repository. The data set is in long format and contains the following variables:
name
: country nameiso3
: ISO3 country codeyear
: year of observationregion_sdg
: SDG regionresidence
: residence type (national, rural, urban)varname_short
: short variable name (JMP naming convention)varname_long
: long variable name (JMP naming convention)
We use the read_csv()
function to import the data set into R.
<- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv") sanitation
Transform
Task 1.1
- Run all code chunks above.
- Use the
filter()
function to create a subset from thesanitation
data containing national estimates for the year 2020. - Store the result as a new object in your environment with the name
sanitation_national_2020
<- sanitation |>
sanitation_national_2020 filter(residence == "national", year == 2020)
Task 1.2
- Use the
filter()
function to create a subset from thesanitation
data containing urban and rural estimates for Nigeria. - Store the result as a new object in your environment with the name
sanitation_nigeria_urban_rural
<- sanitation |>
sanitation_nigeria_urban_rural filter(name == "Nigeria", residence != "national")
Task 1.3 (stretch goal)
Use the
ggplot()
function to create a connected scatterplot withgeom_point()
andgeom_line()
for the data you created in Task 1.2.Use the
aes()
function to map the year variable to the x-axis, thepercent
variable to the y-axis, and thevarname_short
variable to color and group aesthetic.Use
facet_wrap()
to create a separate plot urban and rural populations.Change the colors using
scale_color_colorblind()
.
ggplot(data = sanitation_nigeria_urban_rural,
mapping = aes(x = year,
y = percent,
group = varname_short,
color = varname_short)) +
geom_point() +
geom_line() +
facet_wrap(~residence) +
scale_color_colorblind()
Module 03c: Summary data transformation
library(readr)
library(dplyr)
Import
In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.
The data set jmp_wld_sanitation_long
is available in the data
folder of this repository. The data set is in long format and contains the following variables:
name
: country nameiso3
: ISO3 country codeyear
: year of observationregion_sdg
: SDG regionresidence
: residence type (national, rural, urban)varname_short
: short variable name (JMP naming convention)varname_long
: long variable name (JMP naming convention)
We use the read_csv()
function to import the data set into R.
<- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv") sanitation
Task 1.1
- Run all code chunks above.
- Use the
glimpse()
function to get an overview of the data set. - How many variables are in the data set?
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long <chr> "basic sanitation services", "basic sanitation services"…
$ residence <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
Transform
Task 2.1
- Use the
count()
function to identify how many SDG regions are included in the data set. - How many SDG regions are in the data set?
|>
sanitation count(region_sdg)
# A tibble: 8 × 2
region_sdg n
<chr> <int>
1 Australia and New Zealand 630
2 Central and Southern Asia 4410
3 Eastern and South-Eastern Asia 5670
4 Latin America and the Caribbean 15750
5 Northern America and Europe 16695
6 Oceania 6615
7 Sub-Saharan Africa 16065
8 Western Asia and Northern Africa 7875
Task 2.2
- Use the
count()
function to identify the levels in the varname_short and varname_long variables. - Which indicator in varname_long does san_od refer to?
|>
sanitation count(varname_short, varname_long)
# A tibble: 5 × 3
varname_short varname_long n
<chr> <chr> <int>
1 san_bas basic sanitation services 14742
2 san_lim limited sanitation services 14742
3 san_od no sanitation facilities 14742
4 san_sm safely managed sanitation services 14742
5 san_unimp unimproved sanitation facilities 14742
Task 2.3
Use the
filter()
function to create a subset from thesanitation
data containing national estimates for people with “no sanitation facilities” for the year 2020.Store the result as a new object in your environment with the name
sanitation_national_2020_od
.
<- sanitation |>
sanitation_national_2020_od filter(residence == "national",
== 2020,
year == "san_od") varname_short
Task 2.4
- Use the
sanitation_national_2020_od
data and thecount()
function to identify the number of countries with 0% for the indicator “no sanitation facilities” in 2020.
|>
sanitation_national_2020_od count(percent)
# A tibble: 104 × 2
percent n
<dbl> <int>
1 0 96
2 0.00670 1
3 0.0107 1
4 0.0169 1
5 0.0317 1
6 0.0418 1
7 0.0965 1
8 0.100 1
9 0.105 1
10 0.127 1
# ℹ 94 more rows
Task 2.5
- How many countries in
sanitation_national_2020_od
data had no estimate for “no sanitation facilities” in 2020? Tipp: A country without an estimate hasNA
for the percent variable.
|>
sanitation_national_2020_od filter(is.na(percent))
# A tibble: 36 × 8
name iso3 year region_sdg varname_short varname_long residence percent
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Anguilla AIA 2020 Latin Ame… san_od no sanitati… national NA
2 Antigua … ATG 2020 Latin Ame… san_od no sanitati… national NA
3 Argentina ARG 2020 Latin Ame… san_od no sanitati… national NA
4 Aruba ABW 2020 Latin Ame… san_od no sanitati… national NA
5 Azerbaij… AZE 2020 Western A… san_od no sanitati… national NA
6 Bahamas BHS 2020 Latin Ame… san_od no sanitati… national NA
7 Barbados BRB 2020 Latin Ame… san_od no sanitati… national NA
8 Bosnia a… BIH 2020 Northern … san_od no sanitati… national NA
9 British … VGB 2020 Latin Ame… san_od no sanitati… national NA
10 Brunei D… BRN 2020 Eastern a… san_od no sanitati… national NA
# ℹ 26 more rows
Task 2.6
Use the
sanitation_national_2020_od
data in combination withgroup_by()
andsummarise()
functions to calculate the mean, standard deviation and number of countries for the indicator “no sanitation facilities” in 2020.How did you treat the missing values for the percent variable in the calculation?
|>
sanitation_national_2020_od filter(!is.na(percent)) |>
group_by(region_sdg) |>
summarise(
mean = mean(percent, na.rm = TRUE),
sd = sd(percent, na.rm = TRUE),
n = n()
)
# A tibble: 8 × 4
region_sdg mean sd n
<chr> <dbl> <dbl> <int>
1 Australia and New Zealand 0 0 2
2 Central and Southern Asia 3.29 5.38 13
3 Eastern and South-Eastern Asia 5.16 6.97 16
4 Latin America and the Caribbean 2.04 3.70 34
5 Northern America and Europe 0.00698 0.0271 48
6 Oceania 6.80 13.1 16
7 Sub-Saharan Africa 19.4 18.4 47
8 Western Asia and Northern Africa 1.67 5.41 22
Assignment 03: Data transformation with dplyr
library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)
Import
In this exercise we use data of the UNICEF/WHO Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene (WASH). The data is available at https://washdata.org/data and published as an R data package at https://github.com/WASHNote/jmpwashdata/.
The data set jmp_wld_sanitation_long
is available in the data
folder of this repository. The data set is in long format and contains the following variables:
name
: country nameiso3
: ISO3 country codeyear
: year of observationregion_sdg
: SDG regionresidence
: residence type (national, rural, urban)varname_short
: short variable name (JMP naming convention)varname_long
: long variable name (JMP naming convention)
We use the read_csv()
function to import the data set into R.
<- read_csv("/cloud/project/data/jmp_wld_sanitation_long.csv") sanitation
Task 1
- Run all code chunks above.
- Use the
glimpse()
function to get an overview of the data set. - How many variables are in the data set?
glimpse(sanitation)
Rows: 73,710
Columns: 8
$ name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long <chr> "basic sanitation services", "basic sanitation services"…
$ residence <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
Task 2
- Use the
count()
function with varname_short and varname_long to identify the definitions of the levels in these two variables.
|>
sanitation count(varname_short, varname_long)
# A tibble: 5 × 3
varname_short varname_long n
<chr> <chr> <int>
1 san_bas basic sanitation services 14742
2 san_lim limited sanitation services 14742
3 san_od no sanitation facilities 14742
4 san_sm safely managed sanitation services 14742
5 san_unimp unimproved sanitation facilities 14742
Task 3
- Use the
filter()
function to create a subset of the data set that only contains observations:
- for a country of your choice,
- for the year 2000 and 2020,
- for all variables that are not “safely managed sanitation services”.
- Store the result as a new object in your environment with a name of your choice.
<- sanitation |>
sanitation_uga filter(iso3 == "UGA",
%in% c(2000, 2020),
year != "san_sm") varname_short
Task 4
- Use the
count()
function with the data you created in Task 3 to verify that year 2000 and 2020 remained in the year variable.
|>
sanitation_uga count(year)
# A tibble: 2 × 2
year n
<dbl> <int>
1 2000 12
2 2020 12
Task 5
Use the
ggplot()
function to create a bar plot withgeom_col()
for the data you created in Task 3.Use the
aes()
function to map theresidence
variable to the x-axis, thepercent
variable to the y-axis, and thevarname_long
variable to the fill aesthetic.Use
facet_wrap()
to create a separate plot for each year.Change the fill colors using
scale_fill_colorblind()
.Add labels to the bars by copying the code below this bullet point and adding it to your code for the plot.
geom_text(aes(label = round(percent, 1)),
position = position_stack(vjust = 0.5),
size = 3,
color = "white")
ggplot(data = sanitation_uga,
mapping = aes(x = residence,
y = percent,
fill = varname_long)) +
geom_col() +
facet_wrap(~year) +
scale_fill_colorblind() +
geom_text(aes(label = round(percent, 1)),
position = position_stack(vjust = 0.5),
size = 3,
color = "white")
Task 6
If you haven’t worked with JMP indicators before, the following questions will be challenging to answer.
- Look at the plot that you created. What do you notice about the order of the bars / order of the legend?
- What would you want to change?
- Why did we remove “safely managed sanitation services” from the data set in Task 3?
Task 7
- Run the code in the code chunk below.
- What do you observe when you look at the code and plot?
<- sanitation |>
sanitation_2020 filter(year == 2020)
ggplot(data = sanitation_2020,
mapping = aes(x = percent, fill = varname_short)) +
geom_histogram() +
facet_grid(varname_short ~ residence, scales = "free_y") +
scale_fill_colorblind() +
theme(legend.position = "none")
Module 04a: Factors
library(ggplot2)
library(dplyr)
library(readr)
library(ggthemes)
Import
<- read_csv("/cloud/project/data/processed/waste-city-level-sml.csv") waste
Explore
Run all code chunks above.
Use the
glimpse()
function to inspect thewaste
object.What does the data cover? Briefly discuss with your room partner.
glimpse(waste)
Rows: 367
Columns: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg…
$ city <chr> "Jalalabad", "Kandahar", "Mazar-E-Sharif", "Kabul…
$ iso3c <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AGO", "ALB", …
$ income_id <chr> "LIC", "LIC", "LIC", "LIC", "LIC", "LMC", "UMC", …
$ generation_tons_year <dbl> 58914.45, 120971.00, 52368.40, 1989250.00, 91644.…
$ population <dbl> 326585, 429000, 635250, 3700000, 337000, 4508000,…
Use the
count()
function for thewaste
object to count the number of rows for each value of theincome_id
variable.What do the four values of the
income_id
variable represent?
|>
waste count(income_id)
# A tibble: 4 × 2
income_id n
<chr> <int>
1 HIC 88
2 LIC 74
3 LMC 124
4 UMC 81
Transform
- Use the
c()
function to create a vector with the following values: “HIC”, “UMC”, “LMC”, “LIC”. - Use the assignment operator (
<-
) to store the resulting vector as a new object calledlevels_income
.
<- c("HIC", "UMC", "LMC", "LIC") levels_income
Use the
mutate()
function to convert theincome_id
variable to a factor variable with the levels specified in thelevels_income
object.Use the assignment operator (
<-
) to store the resulting data as a new object calledwaste_lvl
.
<- waste |>
waste_lvl mutate(income_id = factor(income_id, levels = levels_income))
- Use the
count()
function to verify that theincome_id
variable is now a factor variable with the correct levels.
|>
waste_lvl count(income_id)
# A tibble: 4 × 2
income_id n
<fct> <int>
1 HIC 88
2 UMC 81
3 LMC 124
4 LIC 74
Starting with
waste_lvl
, use themutate()
function to create a new variable calledgeneration_kg_capita
that contains thegeneration_tons_year
variable divided by thepopulation
variable and multiplied with 1000.Use the assignment operator (
<-
) to store the resulting data as a new object calledwaste_capita
.
<- waste_lvl |>
waste_capita mutate(generation_kg_capita = generation_tons_year / population * 1000)
Visualize
Next to the code chunk option
#| eval:
change the value from false to true.Run the code in the code-chunk below to create a boxplot of the
generation_kg_capita
variable byincome_id
.What do you observe? Discuss with your room partner.
ggplot(data = waste_capita,
mapping = aes(x = income_id,
y = generation_kg_capita,
color = income_id)) +
geom_boxplot(outlier.fill = NA) +
geom_jitter(width = 0.1, alpha = 0.3) +
scale_color_colorblind() +
labs(x = "Income group",
y = "Waste generation (tons per capita per year)")
Module 04b: Data import
library(readr)
library(readxl)
library(dplyr)
Import
Task 1: Import waste data as CSV
- Run all code chunks above.
- Use the
read_csv()
function to import thewaste-city-level.csv
file from thedata/raw
folder. - Assign the resulting data to an object called
waste
.
<- read_csv("/cloud/project/data/raw/waste-city-level.csv") waste
Task 2: Import JMP data as CSV
- Use the
read_csv()
function to import thejmp_wld_sanitation_long.csv
file from thedata/processed
folder. - Assign the resulting data to an object called
san_csv
.
<- read_csv("/cloud/project/data/processed/jmp_wld_sanitation_long.csv") san_csv
Task 3: Import JMP data as RDS
- Use the
read_rds()
function to import thejmp_wld_sanitation_long.rds
file from thedata/processed
folder. - Assign the resulting data to an object called
san_rds
.
<- read_rds("/cloud/project/data/processed/jmp_wld_sanitation_long.rds") san_rds
Task 4: Compare CSV and RDS
- Use the
glimpse()
function to inspect thesan_csv
andsan_rds
objects. - What is the difference between the two objects? Discuss with your room partner.
glimpse(san_csv)
Rows: 73,710
Columns: 8
$ name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <chr> "san_bas", "san_bas", "san_bas", "san_lim", "san_lim", "…
$ varname_long <chr> "basic sanitation services", "basic sanitation services"…
$ residence <chr> "national", "rural", "urban", "national", "rural", "urba…
$ percent <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
glimpse(san_rds)
Rows: 73,710
Columns: 8
$ name <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", …
$ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ region_sdg <chr> "Central and Southern Asia", "Central and Southern Asia"…
$ varname_short <fct> san_bas, san_bas, san_bas, san_lim, san_lim, san_lim, sa…
$ varname_long <fct> basic sanitation services, basic sanitation services, ba…
$ residence <fct> national, rural, urban, national, rural, urban, national…
$ percent <dbl> 21.870802, 19.322798, 30.863719, 5.648528, 3.136148, 14.…
Task 5: Use LLM for an explanation
- Open https://www.perplexity.ai/ in your browser and enter the following prompt:
You are an experienced educator in teaching R to novice users without prior knowledge. Explain what the .rds format is and how it differs from the .csv file format. Avoid technical language.
Read the answer and ask the tool questions for clarification of something is unclear.
Share a link to your conversation here (see screenshot below):
Screenshot
Module 05: Conditions
library(tidyverse)
library(ggthemes)
Import
We continue to work with a subset of the “What a Waste” database.
<- read_csv("/cloud/project/data/processed/waste-city-level-sml.csv") waste
We will also use an example spreadsheet that was created by one of the course participants.
<- readxl::read_excel("/cloud/project/data/raw/TS_poo_2022.xlsx") solids
Explore
glimpse(waste)
Rows: 367
Columns: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afg…
$ city <chr> "Jalalabad", "Kandahar", "Mazar-E-Sharif", "Kabul…
$ iso3c <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AGO", "ALB", …
$ income_id <chr> "LIC", "LIC", "LIC", "LIC", "LIC", "LMC", "UMC", …
$ generation_tons_year <dbl> 58914.45, 120971.00, 52368.40, 1989250.00, 91644.…
$ population <dbl> 326585, 429000, 635250, 3700000, 337000, 4508000,…
|>
waste count(income_id)
# A tibble: 4 × 2
income_id n
<chr> <int>
1 HIC 88
2 LIC 74
3 LMC 124
4 UMC 81
Transform
Conditional statements with mutate() & case_when() of dplyr R package
waste data
<- waste |>
waste_cat mutate(generation_kg_capita = generation_tons_year / population * 1000) |>
mutate(income_cat = case_when(
== "HIC" ~ "high income",
income_id == "UMC" ~ "upper-middle income",
income_id == "LMC" ~ "lower-middle income",
income_id == "LIC" ~ "low income"
income_id
))
<- c("HIC", "UMC", "LMC", "LIC")
levels_income
<- c("high income",
levels_income_cat "upper-middle income",
"lower-middle income",
"low income")
<- waste_cat |>
waste_fct mutate(income_id = factor(income_id, levels = levels_income)) |>
mutate(income_cat = factor(income_cat, levels = levels_income_cat)) |>
relocate(income_cat, .after = income_id)
write_rds(x = waste_fct, file = "/cloud/project/data/processed/waste-city-level-sml.rds")
Faecal sludge solids data
|>
solids mutate(total_solids_gL = case_when(
== "septic tank" ~ total_solids_gL * 100,
source_type .default = total_solids_gL
))
# A tibble: 20 × 5
source_location source_type Sample_Date n_daily_users total_solids_gL
<chr> <chr> <dttm> <dbl> <dbl>
1 household pit latrine 2022-11-01 00:00:00 5 20.5
2 household pit latrine 2022-11-01 00:00:00 7 25.8
3 household pit latrine 2022-11-01 00:00:00 7 22.6
4 household pit latrine 2022-11-01 00:00:00 6 30.9
5 household pit latrine 2022-11-01 00:00:00 8 48.3
6 household septic tank 2022-11-02 00:00:00 9 8
7 household septic tank 2022-11-02 00:00:00 6 11
8 household septic tank 2022-11-02 00:00:00 7 5
9 household septic tank 2022-11-02 00:00:00 7 13
10 household septic tank 2022-11-02 00:00:00 5 9
11 public toilet pit latrine 2022-11-03 00:00:00 35 35.0
12 public toilet pit latrine 2022-11-03 00:00:00 28 29.3
13 public toilet pit latrine 2022-11-03 00:00:00 52 19.9
14 public toilet pit latrine 2022-11-03 00:00:00 19 42.4
15 public toilet pit latrine 2022-11-03 00:00:00 39 28.0
16 public toilet septic tank 2022-11-04 00:00:00 75 7
17 public toilet septic tank 2022-11-04 00:00:00 53 14
18 public toilet septic tank 2022-11-04 00:00:00 47 19
19 public toilet septic tank 2022-11-04 00:00:00 39 9
20 public toilet septic tank 2022-11-04 00:00:00 62 11
Visualize
Categories as character
ggplot(data = waste_cat,
mapping = aes(x = income_cat,
y = generation_kg_capita,
color = income_cat)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(size = 3, width = 0.1, alpha = 0.3) +
scale_color_colorblind() +
labs(x = "Income group",
y = "Waste generation (tons per capita per year)")
Categories as factor
#| eval: true
ggplot(data = waste_fct,
mapping = aes(x = income_cat,
y = generation_kg_capita,
color = income_cat)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(size = 3, width = 0.1, alpha = 0.3) +
scale_color_colorblind() +
labs(x = "Income group",
y = "Waste generation (tons per capita per year)")
Module 05a: case_when()
library(tidyverse)
library(readxl)
Import
We are using another faecal sludge solids example dataset
<- read_xlsx("/cloud/project/data/raw/faecal-sludge-analysis.xlsx") sludge
Task 1
- A mistake happened during data entry for sample id 16. Use
mutate()
andcase_when()
to change the ts value of 0.72 to 8.72.
|>
sludge mutate(ts = case_when(
== 0.72 ~ 8.72,
ts .default = ts
))
# A tibble: 20 × 6
id date_sample system location users ts
<dbl> <dttm> <chr> <chr> <dbl> <dbl>
1 1 2023-11-01 00:00:00 pit latrine household 5 136.
2 2 2023-11-01 00:00:00 pit latrine household 7 102.
3 3 2023-11-01 00:00:00 pit latrine household NA 57.0
4 4 2023-11-01 00:00:00 pit latrine household 6 27.0
5 5 2023-11-01 00:00:00 pit latrine household 12 97.3
6 6 2023-11-02 00:00:00 pit latrine household 7 78.2
7 7 2023-11-02 00:00:00 septic tank household 14 15.2
8 8 2023-11-02 00:00:00 septic tank household 4 29.4
9 9 2023-11-02 00:00:00 septic tank household 10 64.2
10 10 2023-11-02 00:00:00 septic tank household 12 8.01
11 11 2023-11-03 00:00:00 pit latrine public toilet 50 11.2
12 12 2023-11-03 00:00:00 pit latrine public toilet 32 84.0
13 13 2023-11-03 00:00:00 pit latrine public toilet 41 55.9
14 14 2023-11-03 00:00:00 pit latrine public toilet 160 15.3
15 15 2023-11-03 00:00:00 pit latrine public toilet 20 22.6
16 16 2023-11-04 00:00:00 septic tank public toilet 26 8.72
17 17 2023-11-04 00:00:00 septic tank public toilet 91 43.9
18 18 2023-11-04 00:00:00 septic tank public toilet 68 10.4
19 19 2023-11-04 00:00:00 septic tank public toilet 112 23.2
20 20 2023-11-04 00:00:00 septic tank public toilet 59 15.6
Task 2
- Another mistake happened during data entry for sample id 6. Use
mutate()
andcase_when()
to change the system value of id 6 from “pit latrine” to “septic tank”.
|>
sludge mutate(system = case_when(
== 6 ~ "septic tank",
id .default = system
))
# A tibble: 20 × 6
id date_sample system location users ts
<dbl> <dttm> <chr> <chr> <dbl> <dbl>
1 1 2023-11-01 00:00:00 pit latrine household 5 136.
2 2 2023-11-01 00:00:00 pit latrine household 7 102.
3 3 2023-11-01 00:00:00 pit latrine household NA 57.0
4 4 2023-11-01 00:00:00 pit latrine household 6 27.0
5 5 2023-11-01 00:00:00 pit latrine household 12 97.3
6 6 2023-11-02 00:00:00 septic tank household 7 78.2
7 7 2023-11-02 00:00:00 septic tank household 14 15.2
8 8 2023-11-02 00:00:00 septic tank household 4 29.4
9 9 2023-11-02 00:00:00 septic tank household 10 64.2
10 10 2023-11-02 00:00:00 septic tank household 12 8.01
11 11 2023-11-03 00:00:00 pit latrine public toilet 50 11.2
12 12 2023-11-03 00:00:00 pit latrine public toilet 32 84.0
13 13 2023-11-03 00:00:00 pit latrine public toilet 41 55.9
14 14 2023-11-03 00:00:00 pit latrine public toilet 160 15.3
15 15 2023-11-03 00:00:00 pit latrine public toilet 20 22.6
16 16 2023-11-04 00:00:00 septic tank public toilet 26 0.72
17 17 2023-11-04 00:00:00 septic tank public toilet 91 43.9
18 18 2023-11-04 00:00:00 septic tank public toilet 68 10.4
19 19 2023-11-04 00:00:00 septic tank public toilet 112 23.2
20 20 2023-11-04 00:00:00 septic tank public toilet 59 15.6
Task 3 (stretch goal)
- Add a new variable with the name
ts_cat
to the data frame that categorizes sludge samples into low, medium and high solids content.Usemutate()
andcase_when()
to create the new variable.
- samples with less than 15 g/L are categorized as low
- samples with 15 g/L to 50 g/L are categorized as medium
- samples with more than 50 g/L are categorized as high
|>
sludge mutate(ts_cat = case_when(
< 15 ~ "low",
ts >= 15 & ts <= 50 ~ "medium",
ts > 50 ~ "high"
ts ))
# A tibble: 20 × 7
id date_sample system location users ts ts_cat
<dbl> <dttm> <chr> <chr> <dbl> <dbl> <chr>
1 1 2023-11-01 00:00:00 pit latrine household 5 136. high
2 2 2023-11-01 00:00:00 pit latrine household 7 102. high
3 3 2023-11-01 00:00:00 pit latrine household NA 57.0 high
4 4 2023-11-01 00:00:00 pit latrine household 6 27.0 medium
5 5 2023-11-01 00:00:00 pit latrine household 12 97.3 high
6 6 2023-11-02 00:00:00 pit latrine household 7 78.2 high
7 7 2023-11-02 00:00:00 septic tank household 14 15.2 medium
8 8 2023-11-02 00:00:00 septic tank household 4 29.4 medium
9 9 2023-11-02 00:00:00 septic tank household 10 64.2 high
10 10 2023-11-02 00:00:00 septic tank household 12 8.01 low
11 11 2023-11-03 00:00:00 pit latrine public toilet 50 11.2 low
12 12 2023-11-03 00:00:00 pit latrine public toilet 32 84.0 high
13 13 2023-11-03 00:00:00 pit latrine public toilet 41 55.9 high
14 14 2023-11-03 00:00:00 pit latrine public toilet 160 15.3 medium
15 15 2023-11-03 00:00:00 pit latrine public toilet 20 22.6 medium
16 16 2023-11-04 00:00:00 septic tank public toilet 26 0.72 low
17 17 2023-11-04 00:00:00 septic tank public toilet 91 43.9 medium
18 18 2023-11-04 00:00:00 septic tank public toilet 68 10.4 low
19 19 2023-11-04 00:00:00 septic tank public toilet 112 23.2 medium
20 20 2023-11-04 00:00:00 septic tank public toilet 59 15.6 medium
Module 05b: Dates
library(tidyverse)
library(readxl)
Transform to ISO
<- read_excel("/cloud/project/data/raw/date-formats.xlsx") dates
In R and other programming languages, dates are stored as numbers. The number of days since 1970-01-01 is the ISO 8601 standard.
In Excel, dates are stored as numbers of days since 1900-01-01. In Excel, the date number 1 corresponds to “1900-01-01,” but this system incorrectly considers 1900 as a leap year, which it is not. As a result, to correctly interpret date numbers that originate from systems like Excel, the origin “1899-12-30” is used to account for this discrepancy
<- dates |>
dates_class mutate(date_iso = as_date(date_iso)) |>
mutate(date_us = mdy(date_us)) |>
mutate(date_eu = dmy(date_eu)) |>
mutate(date_num = as_date(date_num, origin = "1899-12-30")) |>
mutate(date = as_date(date_time)) |>
mutate(date_time_tz = with_tz(date_time, tzone = "Africa/Kampala")) |>
mutate(today = today())
OlsonNames()
[1] "Africa/Abidjan" "Africa/Accra"
[3] "Africa/Addis_Ababa" "Africa/Algiers"
[5] "Africa/Asmara" "Africa/Asmera"
[7] "Africa/Bamako" "Africa/Bangui"
[9] "Africa/Banjul" "Africa/Bissau"
[11] "Africa/Blantyre" "Africa/Brazzaville"
[13] "Africa/Bujumbura" "Africa/Cairo"
[15] "Africa/Casablanca" "Africa/Ceuta"
[17] "Africa/Conakry" "Africa/Dakar"
[19] "Africa/Dar_es_Salaam" "Africa/Djibouti"
[21] "Africa/Douala" "Africa/El_Aaiun"
[23] "Africa/Freetown" "Africa/Gaborone"
[25] "Africa/Harare" "Africa/Johannesburg"
[27] "Africa/Juba" "Africa/Kampala"
[29] "Africa/Khartoum" "Africa/Kigali"
[31] "Africa/Kinshasa" "Africa/Lagos"
[33] "Africa/Libreville" "Africa/Lome"
[35] "Africa/Luanda" "Africa/Lubumbashi"
[37] "Africa/Lusaka" "Africa/Malabo"
[39] "Africa/Maputo" "Africa/Maseru"
[41] "Africa/Mbabane" "Africa/Mogadishu"
[43] "Africa/Monrovia" "Africa/Nairobi"
[45] "Africa/Ndjamena" "Africa/Niamey"
[47] "Africa/Nouakchott" "Africa/Ouagadougou"
[49] "Africa/Porto-Novo" "Africa/Sao_Tome"
[51] "Africa/Timbuktu" "Africa/Tripoli"
[53] "Africa/Tunis" "Africa/Windhoek"
[55] "America/Adak" "America/Anchorage"
[57] "America/Anguilla" "America/Antigua"
[59] "America/Araguaina" "America/Argentina/Buenos_Aires"
[61] "America/Argentina/Catamarca" "America/Argentina/ComodRivadavia"
[63] "America/Argentina/Cordoba" "America/Argentina/Jujuy"
[65] "America/Argentina/La_Rioja" "America/Argentina/Mendoza"
[67] "America/Argentina/Rio_Gallegos" "America/Argentina/Salta"
[69] "America/Argentina/San_Juan" "America/Argentina/San_Luis"
[71] "America/Argentina/Tucuman" "America/Argentina/Ushuaia"
[73] "America/Aruba" "America/Asuncion"
[75] "America/Atikokan" "America/Atka"
[77] "America/Bahia" "America/Bahia_Banderas"
[79] "America/Barbados" "America/Belem"
[81] "America/Belize" "America/Blanc-Sablon"
[83] "America/Boa_Vista" "America/Bogota"
[85] "America/Boise" "America/Buenos_Aires"
[87] "America/Cambridge_Bay" "America/Campo_Grande"
[89] "America/Cancun" "America/Caracas"
[91] "America/Catamarca" "America/Cayenne"
[93] "America/Cayman" "America/Chicago"
[95] "America/Chihuahua" "America/Ciudad_Juarez"
[97] "America/Coral_Harbour" "America/Cordoba"
[99] "America/Costa_Rica" "America/Creston"
[101] "America/Cuiaba" "America/Curacao"
[103] "America/Danmarkshavn" "America/Dawson"
[105] "America/Dawson_Creek" "America/Denver"
[107] "America/Detroit" "America/Dominica"
[109] "America/Edmonton" "America/Eirunepe"
[111] "America/El_Salvador" "America/Ensenada"
[113] "America/Fort_Nelson" "America/Fort_Wayne"
[115] "America/Fortaleza" "America/Glace_Bay"
[117] "America/Godthab" "America/Goose_Bay"
[119] "America/Grand_Turk" "America/Grenada"
[121] "America/Guadeloupe" "America/Guatemala"
[123] "America/Guayaquil" "America/Guyana"
[125] "America/Halifax" "America/Havana"
[127] "America/Hermosillo" "America/Indiana/Indianapolis"
[129] "America/Indiana/Knox" "America/Indiana/Marengo"
[131] "America/Indiana/Petersburg" "America/Indiana/Tell_City"
[133] "America/Indiana/Vevay" "America/Indiana/Vincennes"
[135] "America/Indiana/Winamac" "America/Indianapolis"
[137] "America/Inuvik" "America/Iqaluit"
[139] "America/Jamaica" "America/Jujuy"
[141] "America/Juneau" "America/Kentucky/Louisville"
[143] "America/Kentucky/Monticello" "America/Knox_IN"
[145] "America/Kralendijk" "America/La_Paz"
[147] "America/Lima" "America/Los_Angeles"
[149] "America/Louisville" "America/Lower_Princes"
[151] "America/Maceio" "America/Managua"
[153] "America/Manaus" "America/Marigot"
[155] "America/Martinique" "America/Matamoros"
[157] "America/Mazatlan" "America/Mendoza"
[159] "America/Menominee" "America/Merida"
[161] "America/Metlakatla" "America/Mexico_City"
[163] "America/Miquelon" "America/Moncton"
[165] "America/Monterrey" "America/Montevideo"
[167] "America/Montreal" "America/Montserrat"
[169] "America/Nassau" "America/New_York"
[171] "America/Nipigon" "America/Nome"
[173] "America/Noronha" "America/North_Dakota/Beulah"
[175] "America/North_Dakota/Center" "America/North_Dakota/New_Salem"
[177] "America/Nuuk" "America/Ojinaga"
[179] "America/Panama" "America/Pangnirtung"
[181] "America/Paramaribo" "America/Phoenix"
[183] "America/Port_of_Spain" "America/Port-au-Prince"
[185] "America/Porto_Acre" "America/Porto_Velho"
[187] "America/Puerto_Rico" "America/Punta_Arenas"
[189] "America/Rainy_River" "America/Rankin_Inlet"
[191] "America/Recife" "America/Regina"
[193] "America/Resolute" "America/Rio_Branco"
[195] "America/Rosario" "America/Santa_Isabel"
[197] "America/Santarem" "America/Santiago"
[199] "America/Santo_Domingo" "America/Sao_Paulo"
[201] "America/Scoresbysund" "America/Shiprock"
[203] "America/Sitka" "America/St_Barthelemy"
[205] "America/St_Johns" "America/St_Kitts"
[207] "America/St_Lucia" "America/St_Thomas"
[209] "America/St_Vincent" "America/Swift_Current"
[211] "America/Tegucigalpa" "America/Thule"
[213] "America/Thunder_Bay" "America/Tijuana"
[215] "America/Toronto" "America/Tortola"
[217] "America/Vancouver" "America/Virgin"
[219] "America/Whitehorse" "America/Winnipeg"
[221] "America/Yakutat" "America/Yellowknife"
[223] "Antarctica/Casey" "Antarctica/Davis"
[225] "Antarctica/DumontDUrville" "Antarctica/Macquarie"
[227] "Antarctica/Mawson" "Antarctica/McMurdo"
[229] "Antarctica/Palmer" "Antarctica/Rothera"
[231] "Antarctica/South_Pole" "Antarctica/Syowa"
[233] "Antarctica/Troll" "Antarctica/Vostok"
[235] "Arctic/Longyearbyen" "Asia/Aden"
[237] "Asia/Almaty" "Asia/Amman"
[239] "Asia/Anadyr" "Asia/Aqtau"
[241] "Asia/Aqtobe" "Asia/Ashgabat"
[243] "Asia/Ashkhabad" "Asia/Atyrau"
[245] "Asia/Baghdad" "Asia/Bahrain"
[247] "Asia/Baku" "Asia/Bangkok"
[249] "Asia/Barnaul" "Asia/Beirut"
[251] "Asia/Bishkek" "Asia/Brunei"
[253] "Asia/Calcutta" "Asia/Chita"
[255] "Asia/Choibalsan" "Asia/Chongqing"
[257] "Asia/Chungking" "Asia/Colombo"
[259] "Asia/Dacca" "Asia/Damascus"
[261] "Asia/Dhaka" "Asia/Dili"
[263] "Asia/Dubai" "Asia/Dushanbe"
[265] "Asia/Famagusta" "Asia/Gaza"
[267] "Asia/Harbin" "Asia/Hebron"
[269] "Asia/Ho_Chi_Minh" "Asia/Hong_Kong"
[271] "Asia/Hovd" "Asia/Irkutsk"
[273] "Asia/Istanbul" "Asia/Jakarta"
[275] "Asia/Jayapura" "Asia/Jerusalem"
[277] "Asia/Kabul" "Asia/Kamchatka"
[279] "Asia/Karachi" "Asia/Kashgar"
[281] "Asia/Kathmandu" "Asia/Katmandu"
[283] "Asia/Khandyga" "Asia/Kolkata"
[285] "Asia/Krasnoyarsk" "Asia/Kuala_Lumpur"
[287] "Asia/Kuching" "Asia/Kuwait"
[289] "Asia/Macao" "Asia/Macau"
[291] "Asia/Magadan" "Asia/Makassar"
[293] "Asia/Manila" "Asia/Muscat"
[295] "Asia/Nicosia" "Asia/Novokuznetsk"
[297] "Asia/Novosibirsk" "Asia/Omsk"
[299] "Asia/Oral" "Asia/Phnom_Penh"
[301] "Asia/Pontianak" "Asia/Pyongyang"
[303] "Asia/Qatar" "Asia/Qostanay"
[305] "Asia/Qyzylorda" "Asia/Rangoon"
[307] "Asia/Riyadh" "Asia/Saigon"
[309] "Asia/Sakhalin" "Asia/Samarkand"
[311] "Asia/Seoul" "Asia/Shanghai"
[313] "Asia/Singapore" "Asia/Srednekolymsk"
[315] "Asia/Taipei" "Asia/Tashkent"
[317] "Asia/Tbilisi" "Asia/Tehran"
[319] "Asia/Tel_Aviv" "Asia/Thimbu"
[321] "Asia/Thimphu" "Asia/Tokyo"
[323] "Asia/Tomsk" "Asia/Ujung_Pandang"
[325] "Asia/Ulaanbaatar" "Asia/Ulan_Bator"
[327] "Asia/Urumqi" "Asia/Ust-Nera"
[329] "Asia/Vientiane" "Asia/Vladivostok"
[331] "Asia/Yakutsk" "Asia/Yangon"
[333] "Asia/Yekaterinburg" "Asia/Yerevan"
[335] "Atlantic/Azores" "Atlantic/Bermuda"
[337] "Atlantic/Canary" "Atlantic/Cape_Verde"
[339] "Atlantic/Faeroe" "Atlantic/Faroe"
[341] "Atlantic/Jan_Mayen" "Atlantic/Madeira"
[343] "Atlantic/Reykjavik" "Atlantic/South_Georgia"
[345] "Atlantic/St_Helena" "Atlantic/Stanley"
[347] "Australia/ACT" "Australia/Adelaide"
[349] "Australia/Brisbane" "Australia/Broken_Hill"
[351] "Australia/Canberra" "Australia/Currie"
[353] "Australia/Darwin" "Australia/Eucla"
[355] "Australia/Hobart" "Australia/LHI"
[357] "Australia/Lindeman" "Australia/Lord_Howe"
[359] "Australia/Melbourne" "Australia/North"
[361] "Australia/NSW" "Australia/Perth"
[363] "Australia/Queensland" "Australia/South"
[365] "Australia/Sydney" "Australia/Tasmania"
[367] "Australia/Victoria" "Australia/West"
[369] "Australia/Yancowinna" "Brazil/Acre"
[371] "Brazil/DeNoronha" "Brazil/East"
[373] "Brazil/West" "Canada/Atlantic"
[375] "Canada/Central" "Canada/Eastern"
[377] "Canada/Mountain" "Canada/Newfoundland"
[379] "Canada/Pacific" "Canada/Saskatchewan"
[381] "Canada/Yukon" "CET"
[383] "Chile/Continental" "Chile/EasterIsland"
[385] "CST6CDT" "Cuba"
[387] "EET" "Egypt"
[389] "Eire" "EST"
[391] "EST5EDT" "Etc/GMT"
[393] "Etc/GMT-0" "Etc/GMT-1"
[395] "Etc/GMT-10" "Etc/GMT-11"
[397] "Etc/GMT-12" "Etc/GMT-13"
[399] "Etc/GMT-14" "Etc/GMT-2"
[401] "Etc/GMT-3" "Etc/GMT-4"
[403] "Etc/GMT-5" "Etc/GMT-6"
[405] "Etc/GMT-7" "Etc/GMT-8"
[407] "Etc/GMT-9" "Etc/GMT+0"
[409] "Etc/GMT+1" "Etc/GMT+10"
[411] "Etc/GMT+11" "Etc/GMT+12"
[413] "Etc/GMT+2" "Etc/GMT+3"
[415] "Etc/GMT+4" "Etc/GMT+5"
[417] "Etc/GMT+6" "Etc/GMT+7"
[419] "Etc/GMT+8" "Etc/GMT+9"
[421] "Etc/GMT0" "Etc/Greenwich"
[423] "Etc/UCT" "Etc/Universal"
[425] "Etc/UTC" "Etc/Zulu"
[427] "Europe/Amsterdam" "Europe/Andorra"
[429] "Europe/Astrakhan" "Europe/Athens"
[431] "Europe/Belfast" "Europe/Belgrade"
[433] "Europe/Berlin" "Europe/Bratislava"
[435] "Europe/Brussels" "Europe/Bucharest"
[437] "Europe/Budapest" "Europe/Busingen"
[439] "Europe/Chisinau" "Europe/Copenhagen"
[441] "Europe/Dublin" "Europe/Gibraltar"
[443] "Europe/Guernsey" "Europe/Helsinki"
[445] "Europe/Isle_of_Man" "Europe/Istanbul"
[447] "Europe/Jersey" "Europe/Kaliningrad"
[449] "Europe/Kiev" "Europe/Kirov"
[451] "Europe/Kyiv" "Europe/Lisbon"
[453] "Europe/Ljubljana" "Europe/London"
[455] "Europe/Luxembourg" "Europe/Madrid"
[457] "Europe/Malta" "Europe/Mariehamn"
[459] "Europe/Minsk" "Europe/Monaco"
[461] "Europe/Moscow" "Europe/Nicosia"
[463] "Europe/Oslo" "Europe/Paris"
[465] "Europe/Podgorica" "Europe/Prague"
[467] "Europe/Riga" "Europe/Rome"
[469] "Europe/Samara" "Europe/San_Marino"
[471] "Europe/Sarajevo" "Europe/Saratov"
[473] "Europe/Simferopol" "Europe/Skopje"
[475] "Europe/Sofia" "Europe/Stockholm"
[477] "Europe/Tallinn" "Europe/Tirane"
[479] "Europe/Tiraspol" "Europe/Ulyanovsk"
[481] "Europe/Uzhgorod" "Europe/Vaduz"
[483] "Europe/Vatican" "Europe/Vienna"
[485] "Europe/Vilnius" "Europe/Volgograd"
[487] "Europe/Warsaw" "Europe/Zagreb"
[489] "Europe/Zaporozhye" "Europe/Zurich"
[491] "Factory" "GB"
[493] "GB-Eire" "GMT"
[495] "GMT-0" "GMT+0"
[497] "GMT0" "Greenwich"
[499] "Hongkong" "HST"
[501] "Iceland" "Indian/Antananarivo"
[503] "Indian/Chagos" "Indian/Christmas"
[505] "Indian/Cocos" "Indian/Comoro"
[507] "Indian/Kerguelen" "Indian/Mahe"
[509] "Indian/Maldives" "Indian/Mauritius"
[511] "Indian/Mayotte" "Indian/Reunion"
[513] "Iran" "Israel"
[515] "Jamaica" "Japan"
[517] "Kwajalein" "Libya"
[519] "MET" "Mexico/BajaNorte"
[521] "Mexico/BajaSur" "Mexico/General"
[523] "MST" "MST7MDT"
[525] "Navajo" "NZ"
[527] "NZ-CHAT" "Pacific/Apia"
[529] "Pacific/Auckland" "Pacific/Bougainville"
[531] "Pacific/Chatham" "Pacific/Chuuk"
[533] "Pacific/Easter" "Pacific/Efate"
[535] "Pacific/Enderbury" "Pacific/Fakaofo"
[537] "Pacific/Fiji" "Pacific/Funafuti"
[539] "Pacific/Galapagos" "Pacific/Gambier"
[541] "Pacific/Guadalcanal" "Pacific/Guam"
[543] "Pacific/Honolulu" "Pacific/Johnston"
[545] "Pacific/Kanton" "Pacific/Kiritimati"
[547] "Pacific/Kosrae" "Pacific/Kwajalein"
[549] "Pacific/Majuro" "Pacific/Marquesas"
[551] "Pacific/Midway" "Pacific/Nauru"
[553] "Pacific/Niue" "Pacific/Norfolk"
[555] "Pacific/Noumea" "Pacific/Pago_Pago"
[557] "Pacific/Palau" "Pacific/Pitcairn"
[559] "Pacific/Pohnpei" "Pacific/Ponape"
[561] "Pacific/Port_Moresby" "Pacific/Rarotonga"
[563] "Pacific/Saipan" "Pacific/Samoa"
[565] "Pacific/Tahiti" "Pacific/Tarawa"
[567] "Pacific/Tongatapu" "Pacific/Truk"
[569] "Pacific/Wake" "Pacific/Wallis"
[571] "Pacific/Yap" "Poland"
[573] "Portugal" "PRC"
[575] "PST8PDT" "ROC"
[577] "ROK" "Singapore"
[579] "SystemV/AST4" "SystemV/AST4ADT"
[581] "SystemV/CST6" "SystemV/CST6CDT"
[583] "SystemV/EST5" "SystemV/EST5EDT"
[585] "SystemV/HST10" "SystemV/MST7"
[587] "SystemV/MST7MDT" "SystemV/PST8"
[589] "SystemV/PST8PDT" "SystemV/YST9"
[591] "SystemV/YST9YDT" "Turkey"
[593] "UCT" "Universal"
[595] "US/Alaska" "US/Aleutian"
[597] "US/Arizona" "US/Central"
[599] "US/East-Indiana" "US/Eastern"
[601] "US/Hawaii" "US/Indiana-Starke"
[603] "US/Michigan" "US/Mountain"
[605] "US/Pacific" "US/Samoa"
[607] "UTC" "W-SU"
[609] "WET" "Zulu"
attr(,"Version")
[1] "2024a"
as.numeric(today())
[1] 19887
as_date(1)
[1] "1970-01-02"
|>
dates_class select(today) |>
mutate(year = year(today)) |>
mutate(month = month(today, label = TRUE, abbr = FALSE, locale = "fr_FR")) |>
mutate(quarter = quarter(today)) |>
mutate(week = week(today)) |>
mutate(day = day(today)) |>
mutate(day_of_week = wday(today, label = TRUE, abbr = FALSE, locale = "fr_FR")) |>
mutate(day_of_year = yday(today)) |>
mutate(week_of_year = week(today))
# A tibble: 1 × 9
today year month quarter week day day_of_week day_of_year
<date> <dbl> <ord> <int> <dbl> <int> <ord> <dbl>
1 2024-06-13 2024 June 2 24 13 Thursday 165
# ℹ 1 more variable: week_of_year <dbl>
Module 05c: Tables
library(tidyverse)
library(gt)
library(gtsummary)
library(knitr)
library(DT)
Import
We continue to work with a subset of the “What a Waste” database.
<- read_rds("/cloud/project/data/processed/waste-city-level-sml.rds") waste_gt
Transform
<- waste_gt |>
waste_tbl_income filter(!is.na(generation_kg_capita)) |>
group_by(income_cat) |>
summarise(
count = n(),
mean = mean(generation_kg_capita),
sd = sd(generation_kg_capita),
median = median(generation_kg_capita),
min = min(generation_kg_capita),
max = max(generation_kg_capita)
)
Table
waste_tbl_income
# A tibble: 4 × 7
income_cat count mean sd median min max
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 high income 71 477. 214. 421. 116. 1142.
2 upper-middle income 72 381. 133. 378. 130. 828.
3 lower-middle income 116 275. 179. 219. 62.1 1109.
4 low income 67 215. 130. 182. 6.86 694.
|>
waste_tbl_income gt() |>
tab_header(title = "Waste generation per capita (kg/year) by income group",
subtitle = "Data from 326 cities") |>
fmt_number(columns = count:max, decimals = 0) |>
cols_label(income_cat = "income category")
Waste generation per capita (kg/year) by income group | ||||||
---|---|---|---|---|---|---|
Data from 326 cities | ||||||
income category | count | mean | sd | median | min | max |
high income | 71 | 477 | 214 | 421 | 116 | 1,142 |
upper-middle income | 72 | 381 | 133 | 378 | 130 | 828 |
lower-middle income | 116 | 275 | 179 | 219 | 62 | 1,109 |
low income | 67 | 215 | 130 | 182 | 7 | 694 |
Table 1 highlights that cities in countries classfied as high income countries generate more waste per capita than cities in lower income countries.
|>
waste_tbl_income rename(`income category` = income_cat) |>
kable(digits = 0)
income category | count | mean | sd | median | min | max |
---|---|---|---|---|---|---|
high income | 71 | 477 | 214 | 421 | 116 | 1142 |
upper-middle income | 72 | 381 | 133 | 378 | 130 | 828 |
lower-middle income | 116 | 275 | 179 | 219 | 62 | 1109 |
low income | 67 | 215 | 130 | 182 | 7 | 694 |
Module 06a: Cross-references
Tables and Figures
library(tidyverse)
library(ggthemes)
library(palmerpenguins)
library(gt)
Task 1: Tables
Render the document and identify if the cross-reference to the table generated from the code below works.
Fix the label in the code-chunk below so that the cross-reference works.
Render the document to check if the cross-reference to the table generated from the code below works
See Table 2 for data on a few penguins.
|>
penguins filter(!is.na(bill_depth_mm)) |>
group_by(island, species) |>
summarise(n = n(),
mean_bill_depth = mean(bill_depth_mm),
sd_bill_depth = sd(bill_depth_mm)) |>
ungroup() |>
gt() |>
fmt_number(columns = c(mean_bill_depth, sd_bill_depth),
decimals = 1)
island | species | n | mean_bill_depth | sd_bill_depth |
---|---|---|---|---|
Biscoe | Adelie | 44 | 18.4 | 1.2 |
Biscoe | Gentoo | 123 | 15.0 | 1.0 |
Dream | Adelie | 56 | 18.3 | 1.1 |
Dream | Chinstrap | 68 | 18.4 | 1.1 |
Torgersen | Adelie | 51 | 18.4 | 1.3 |
Task 2: Figures
- Add a caption and a label for a figure to the code chunk options below.
- Add a cross-reference to the figure generated from the code below.
In Figure 1, we see that …
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
color = species,
shape = species)) +
geom_point() +
scale_color_colorblind() +
labs(x = "Bill length (mm)", y = "Bill depth (mm)") +
theme_minimal()
Module 06b: Vector types”
library(tidyverse)
library(gapminder)
Part 1: (Atomic) Vectors
Atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw.
Integer and double vectors are collectively known as numeric vectors.
- lgl: logical
- int: integer
- dbl: double
- chr: character
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Types of atomic vectors
<- c(TRUE, TRUE, FALSE)
vector_lgl typeof(vector_lgl)
[1] "logical"
sum(vector_lgl)
[1] 2
as.numeric(vector_lgl)
[1] 1 1 0
<- c(1L, 3L, 6L)
vector_int typeof(vector_int)
[1] "integer"
<- c(1293, 5.1, 90.5)
vector_dbl typeof(vector_dbl)
[1] "double"
<- c("large", "small", "medium")
vector_chr typeof(vector_chr)
[1] "character"
Logical vectors
> 150 vector_dbl
[1] TRUE FALSE FALSE
"large" == vector_chr
[1] TRUE FALSE FALSE
str_detect(vector_chr, "lar")
[1] TRUE FALSE FALSE
Explicit vector coercion & augmented vectors
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behavior. For example, factors are built on top of integer vectors.
<- factor(vector_chr, levels = c("small", "medium", "large"))
vector_fct
typeof(vector_fct)
[1] "integer"
attributes(vector_fct)
$levels
[1] "small" "medium" "large"
$class
[1] "factor"
as.integer(vector_fct)
[1] 3 1 2
Tibbles / Dataframes
Tibbles / dataframes have vectors as columns. Each vector has the same length. Each vector can be thought of as a column and the length of each vector is the number of rows.
<- tibble(
tib_data
vector_lgl,
vector_int,
vector_dbl,
vector_chr,
vector_fct,date = Sys.Date()
)
Accessing a vector from a dataframe
|>
tib_data pull(vector_fct)
[1] large small medium
Levels: small medium large
$vector_fct tib_data
[1] large small medium
Levels: small medium large
5] tib_data[
# A tibble: 3 × 1
vector_fct
<fct>
1 large
2 small
3 medium
5]] tib_data[[
[1] large small medium
Levels: small medium large
Part 2: Programming with R
For loops
Iterate code for each element in a vector.
<- tib_data$vector_fct
size
for (s in size) {
<- paste(
msg "------", s, "------"
)print(msg)
}
[1] "------ large ------"
[1] "------ small ------"
[1] "------ medium ------"
If statement
<- c("bat", "cat", "dog", "bird", "horse")
pet
for(p in pet) {
if(p == "dog") {
<- paste("A", p, "is the best!")
msg else {
} <- paste("A", p, "is okay I guess.")
msg
}print(msg)
}
[1] "A bat is okay I guess."
[1] "A cat is okay I guess."
[1] "A dog is the best!"
[1] "A bird is okay I guess."
[1] "A horse is okay I guess."
<- c(NA, "meow", "woof", "chirp", "neigh")
sounds
for (i in seq_along(pet)) {
if (pet[i] == "dog") {
<- paste("The", pet[i], "goes", sounds[i])
message else {
} <- paste("The", pet[i], "says", sounds[i])
message
}print(message)
}
[1] "The bat says NA"
[1] "The cat says meow"
[1] "The dog goes woof"
[1] "The bird says chirp"
[1] "The horse says neigh"
Module 06c: Excercises
library(tidyverse)
library(nycflights13)
Task 1: Numeric vector
Create a numeric vector using
c()
with the numbers from 1, 2, 3, 4, 5 to 10. Run the code.Create a numeric vector using
seq(1, 10)
and run the code.What’s the difference between the two vectors?
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
Task 2: Character vector
Create a character vector using
c()
with the letters from “a” to “f”. Run the code.On a new line, write
letters
and run the code. What’s stored in theletters
object?On a new line, write
?letters
and run the code. What did you learn?
c("a", "b", "c", "d", "e", "f")
[1] "a" "b" "c" "d" "e" "f"
letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
Task 3: Numeric sequences
Create a numeric vector using
seq(1, 100, 1)
and run the code. What does the code do?Create a numeric vector using
runif(100, 1, 100)
and run the code. What does the code do?Create a numeric vector using
sample(1:100, 100, replace = FALSE)
and run the code. What does the code do?
seq(1, 100, 1)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
runif(100, 1, 100)
[1] 38.778210 58.330600 27.899647 7.796088 81.541475 71.885980 84.836257
[8] 29.604210 66.357969 17.926685 20.439570 50.686240 44.896382 20.140063
[15] 16.415615 35.256813 15.242959 75.188393 24.823551 97.258622 60.565691
[22] 41.860249 25.692112 10.665276 94.760276 1.062801 44.621982 73.434445
[29] 95.337043 94.564929 76.204060 44.218349 84.533288 91.212240 19.745296
[36] 80.452509 69.095412 80.497132 58.705038 74.817285 16.170162 50.574742
[43] 43.679575 22.219881 19.881512 33.371115 30.016625 90.680238 98.509094
[50] 14.478689 6.739723 26.671297 92.227470 89.322627 2.762168 63.323627
[57] 81.134110 68.450249 94.252972 48.517861 1.506401 34.999222 23.508857
[64] 8.259930 97.435193 67.243346 58.083158 42.099956 94.861612 3.545431
[71] 20.845850 39.500589 92.115329 96.127512 56.378400 4.923146 72.407167
[78] 80.214474 8.567918 45.176046 4.057912 7.897774 79.921334 72.203024
[85] 75.633223 99.979251 64.457892 48.543277 70.607990 17.109271 54.824040
[92] 79.411269 47.832787 16.182316 43.531321 51.209082 40.532696 91.270367
[99] 86.211758 21.596911
sample(1:100, 100, replace = FALSE)
[1] 74 27 26 58 54 64 23 50 73 29 69 22 5 25 85 66 68 28
[19] 59 88 87 72 44 4 75 57 80 7 2 63 37 65 34 43 67 92
[37] 15 97 31 55 49 39 62 83 79 36 24 19 12 89 93 13 32 6
[55] 14 98 82 3 35 47 21 16 71 20 99 81 30 91 42 100 52 1
[73] 77 60 33 17 56 10 61 76 8 78 18 48 40 46 96 41 9 95
[91] 86 90 84 51 53 11 45 70 38 94
seq_along(letters)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
Task 4: Numeric sequences along a character vector
Create a numeric vector using
seq_along(letters)
and run the code. What does the code do?Create a character vector using
month.name
and run the code. What does the code do?Create a numeric vector using
seq_along(month.name)
and run the code. What does the code do?
seq_along(letters)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
seq_along(month.name)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Task 5: Distribution of random numbers
Create a numeric vector
runif(n = 1000, min = 1, max = 100) |> hist()
and run the code. What does the code do? Remove|> hist()
and run the code again. What does the code do?Create a numeric vector
rnorm(n = 1000, mean = 500, sd = 150) |> hist()
and run the code. What does the code do? Remove|> hist()
and run the code again. What does the code do?
runif(n = 1000, min = 1, max = 100) |> hist()
rnorm(n = 1000, mean = 500, sd = 150) |> hist()
Task 6: Logical vectors
Create a numeric vector using
rnorm(n = 1000, mean = 50, sd = 5)
and use the assignment operator to store it in an object callednorm100
. Run the code.Write:
mean(norm100)
and run the code. What does the code do?norm100 >= 50
and run the code. What does the code do?sum(norm100 >= 50)
and run the code. What does the code do?mean(norm100 >= 50)
and run the code. What does the code do?
<- rnorm(n = 1000, mean = 50, sd = 5)
norm_dist
mean(norm_dist)
[1] 50.14718
>= 50 norm_dist
[1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[13] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[25] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[37] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
[49] TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[61] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
[73] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[85] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
[97] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[109] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[121] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
[133] FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
[145] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
[157] FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
[169] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[181] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
[193] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
[205] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
[217] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
[229] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
[241] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
[253] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[265] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
[277] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
[289] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[301] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[313] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[325] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[337] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[349] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[361] TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
[373] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[385] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
[397] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
[409] TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
[421] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
[445] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
[457] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[469] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[481] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
[505] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
[517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
[529] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
[541] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
[553] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
[565] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
[577] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[589] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[601] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
[613] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
[625] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
[649] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[661] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[673] TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
[685] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[697] TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
[709] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
[721] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
[733] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[745] FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[757] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
[769] TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
[781] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[793] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
[805] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
[817] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
[829] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
[841] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[853] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[865] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[877] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[889] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
[901] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[913] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[925] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
[937] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
[949] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[961] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
[973] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[985] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
[997] TRUE FALSE TRUE FALSE
sum(norm_dist >= 50)
[1] 523
mean(norm_dist >= 50)
[1] 0.523
Task 7 (stretch goal)
In this task, we will use the flights
data object of the nycflights13
package. The flights
data object contains information about all flights that departed from NYC (e.g., EWR, JFK and LGA) in 2013. The data object contains 336,776 rows and 19 columns.
Use the
flights
data object withmutate()
to createdelayed
, a variable that displays whether a flight was delayed (arr_delay > 0
).Use relocate to move
delayed
to the front of the data frame. Run the code. What vector type is thedelayed
variable?Then, remove all rows that contain an NA in
delayed
.Finally, create a summary table with
summarise()
that shows
- How many flights were delayed
- What proportion of flights were delayed
|>
flights mutate(delayed = arr_delay > 0) |>
relocate(delayed) |>
filter(!is.na(delayed)) |>
summarise(sum = sum(delayed),
prop = mean(delayed))
# A tibble: 1 × 2
sum prop
<int> <dbl>
1 133004 0.406
Assignment 06: Data formats
Part 1: Data preparation
Task 1: Load packages
The required packages for this homework exercise have already been added.
- Run the code chunk below to load the required packages. Tipp: Click on the green play button in the top right corner of the code chunk.
- What’s the
tidyverse
Package? Describe in maximum two sentences below.The tidyverse is a collection of R packages designed for data science that share a common design philosophy, grammar, and data structures. It provides a powerful and coherent system for working with data, including tools for data import, tidying, manipulation, visualization, and programming.[1][3]
Citations: [1] https://jhudatascience.org/tidyversecourse/intro.html [2] https://www.tmwr.org/tidyverse [3] https://tidyverse.tidyverse.org/articles/paper.html [4] https://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html [5] https://www.datacamp.com/tutorial/tidyverse-tutorial-r
Soruce: https://www.perplexity.ai/search/describe-the-tidyverse-JEwTv1xJTvOFRgullFskTQ
library(tidyverse)
Task 2: Import data
- Use the
read_csv()
(Note: Watch out for the_
and don’t use the.
as inread.csv()
) function to import the “msw-generation-and-composition-by-income.csv” data from thedata
directory and assign it to an object with the namewaste_data
.
= read_csv(file = "/cloud/project/data/msw-generation-and-composition-by-income.csv") waste_data
Task 3: Vector coercion
Use
waste_data
andcount()
to create a frequency table for theincome_cat
variable.Use then
c()
function to create a vector with a sensible order for the the values inincome_cat
. Use the assignment operator<-
to assign the vector to an object with the namelevels_income_cat
.Starting with the
waste_data
object, use the pipe operator and themutate()
function to convert theincome_cat
variable from a variable of type character to a variable of type factor. Use the levels you defined in the previous step.ories using the following code to identify the correct spelling of the categories in the variableincome_cat
.Assign the created data frame to an object with the name
waste_data_fct
.Render and fix any errors
# Create a frequency table for the income_cat variable
%>%
waste_data count(income_cat)
# A tibble: 4 × 2
income_cat n
<chr> <int>
1 high income 81
2 low income 33
3 lower-middle income 47
4 upper-middle income 56
#Create vector with a sensible order for the values in income_cat
<- c(
levels_income_cat "Low income",
"Medium income",
"High income"
)
#Conversion of data type
<- waste_data %>%
waste_data_fct mutate(income_cat = factor(income_cat, levels = levels_income_cat))
Task 4: From wide to long
- Starting with the
waste_data_fct
object, use thepivot_longer()
function to convert the data frame from a wide to a long format. Apply the following:
- bring all columns from
food_organic_waste
toyard_garden_green_waste
into a long format - send the variable names to a column named “waste_category”
- send the values of the variables to a column named “percent”
Remove all
NA
s from thepercent
variableAssign the created data frame to an object with the name
waste_data_long
Render and fix any errors
<- waste_data_fct %>%
waste_data_long pivot_longer(cols = food_organic_waste:yard_garden_green_waste,
names_to = "waste_category",
values_to = "percent") %>%
filter(!is.na(percent)) # Remove rows where percent column is NA
Part 2: Data summary
Task 1: Import data
I have stored the data that I would have expected at the end of the previous task and import it here.
- Run the code in the code chunk below.
<- read_rds("/cloud/project/data/msw-generation-and-composition-by-income-long.rds") waste_data_long
Task 2: Summarise data
Starting with
waste_data_long
, group the data byincome_cat
andwaste_category
, then create a summary table containing the mean of percentages (call this mean_percent) for each group.- could this be done with a for-loop?
Assign the created data frame to an object with the name
waste_data_long_mean
.
<- waste_data_long %>%
waste_data_long_mean group_by(income_cat, waste_category) %>%
summarise(mean_percent = mean(percent))
Task 3: Table display
- Starting with the
waste_data_long_mean
object, execute the code and observe the output in the Console. Would you publish this table in this format in a report?- No, while it is in the long format, it still contains NA values. Before publishing I would try to porperly code the variable “income_cat”
waste_data_long_mean
# A tibble: 36 × 3
# Groups: income_cat [4]
income_cat waste_category mean_percent
<fct> <chr> <dbl>
1 high income food_organic_waste 32.8
2 high income glass 6.12
3 high income metal 5.13
4 high income other 16.8
5 high income paper_cardboard 21.3
6 high income plastic 12.4
7 high income rubber_leather 2.98
8 high income wood 5.54
9 high income yard_garden_green_waste 9.31
10 upper-middle income food_organic_waste 45.4
# ℹ 26 more rows
Task 4: From long to wide
Starting with the
waste_data_long_mean
object, use the pipe operator to add another line of code which uses thepivot_wider()
function to bring the data from a long format into a wide format using names for variables fromwaste_category
and corresponding values frommean_percent
Execute the code and observe the output in the Console. Would you publish this table in a report in this format?
- For a report I find it more intelligible, however no histograms can be plotted and therefore I would not publish it in that way.
Render and fix any errors
%>%
waste_data_long_mean pivot_wider(names_from = waste_category,
values_from = mean_percent)
# A tibble: 4 × 10
# Groups: income_cat [4]
income_cat food_organic_waste glass metal other paper_cardboard plastic
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 high income 32.8 6.12 5.13 16.8 21.3 12.4
2 upper-middle inc… 45.4 4.42 3.88 18.2 12.1 12.3
3 lower-middle inc… 50.4 3.68 3.92 16.5 10.6 11.1
4 low income 50.9 1.94 2.58 28.5 8.35 7.97
# ℹ 3 more variables: rubber_leather <dbl>, wood <dbl>,
# yard_garden_green_waste <dbl>
Part 3: Data visualization
Task 1: Import data
I have stored the data that I would have expected at the end of the previous task and import it here.
- Run the code in the code chunk below.
<- read_rds("/cloud/project/data/msw-generation-and-composition-by-income-long-mean.rds") waste_data_long_mean
Task 2: Reproduce a plot
Render and fix any errors.
Reproduce the plot that you see as an image below when you render the report and view the output in your Viewer tab in the bottom right window.
Hint: To get those bars displayed next to each other, use the geom_col()
function and apply the position = position_dodge()
argument and value. The colors don’t have to be exactly the same colors, just not the default color scale
Note: The size of the plot will be different. That is alright and does not need to match.
ggplot(data = waste_data_long_mean,
aes(x = mean_percent,
y = waste_category,
fill = income_cat)) +
geom_col(position = position_dodge(0.9), na.rm = TRUE) +
labs(x = "Income Category", y = "Mean Percent", fill = "Waste Category") +
theme_minimal()
Module 07: Writing scholarly articles
Scholarly writing
Scholarly articles require much more detail in their front matter than simply a title and an author. Quarto provides a rich set of YAML metadata keys to describe these details. You can copy & paste from this example to your own report.
Task 1: Front Matter
- Replace the values under author for name, orcid, email, and affiliation with your own
- Render the document to see the changes
Task 2: Citations
- Add the citation key for the paper “‘My flight arrives at 5 am, can you pick me up?’: The gatekeeping burden of the african academic” as an in-text reference to the sentence below
In @tilley2021my, the authors describe how visitors still expect a personal pick-up, despite the availability of taxi services.
- Add the citation key for the paper “‘The rich will always be able to dispose of their waste’: a view from the frontlines of municipal failure in Makhanda, South Africa” as a citation at the end of the sentence below.
Inequality underpins waste management systems, structuring who can or cannot access services [@kalina2023rich].
Bibliographies
Your folder already contains a references.bib
file. One way of creating and adding to this file is by using the RStudio Visual Editor mode. Another way is by exporting a collection in Zotero reference management tool. Part of your homework will be to setup Zotero. For your literature research, you will then use Zotero and in your final report use an exported .bib
file to cite references in your report. https://rbtl-fs24.github.io/website/project