Chapter 3

Chapter 3 of R4DS teaches data transformation using packages (such as dplyr) within the tidyverse metapackage. This page will work through a subset of the chapter’s prompts. I will use the nycflights13 dataset to go through some basic data cleaning procedures. I’ll set up by loading the tidyverse and nycflights13 packages.

pacman::p_load(tidyverse, nycflights13)

3.2.5 Exercises:

Exercise 3.2.5.1

In a single pipeline for each condition, find all flights that meet the condition:

Had an arrival delay of two or more hours

Answer: 10,200 flights had a delay of 2+ hours

flights %>% 
  filter(arr_delay >= 120) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1 10200

Flew to Houston (IAH or HOU)

Answer: 9,313 flights flew to Houston

flights %>% 
  filter(dest %in% c("IAH", "HOU")) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1  9313

Were operated by United, American, or Delta

Answer: 139,504 flights were operated by these airlines

flights %>% 
  filter(carrier %in% c("UA", "AA", "DL")) %>% 
  count()

# A tibble: 1 × 1
       n
   <int>
1 139504

Departed in summer (July, August, and September)

Answer: 86,326 flights left during the summer

flights %>% 
  filter(month %in% c(7, 8, 9)) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1 86326

Arrived more than two hours late, but didn’t leave late

Answer: 29 flights left on time and arrived 2+ hours late

flights %>% 
  filter(dep_delay <= 0, arr_delay > 120) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1    29

Were delayed by at least an hour, but made up over 30 minutes in flight

Answer: 1,844 flights left an hour late but made up 30 minutes in flight

flights %>% 
  filter(dep_delay >= 60 & dep_delay - arr_delay > 30) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1  1844

Exercise 3.2.5.2

Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

Answer: The longest delay was 1,301 minutes and many flights left at 12:01AM.

flights %>% 
  arrange(desc(dep_delay))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% 
  arrange(dep_time)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    13        1           2249        72      108           2357
 2  2013     1    31        1           2100       181      124           2225
 3  2013    11    13        1           2359         2      442            440
 4  2013    12    16        1           2359         2      447            437
 5  2013    12    20        1           2359         2      430            440
 6  2013    12    26        1           2359         2      437            440
 7  2013    12    30        1           2359         2      441            437
 8  2013     2    11        1           2100       181      111           2225
 9  2013     2    24        1           2245        76      121           2354
10  2013     3     8        1           2355         6      431            440
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 3.2.5.3

Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

Answer: This is a trickier question than it might seem at first. With the coding skills developed in chapters 1-3, the easy solution would be to run the code below on tab 1, which subtracts dep_time from arr_time and then sorts the values in ascending order. I’ve retained the first 5 records for display.

The problem with the easy way is that these columns are formatted as numbers, not times, and it generates nonsensical numbers, and sometimes produces negative numbers for flights that land the day after they depart. Look at the values in the flight_time column in tab 1. This method gives us the wrong answer because there are flights with small positive values that are actually the shortest, but we aren’t finding them because we are focusing on flights that are at the top of the list because our simple formula produced negative values.

If we do it the hard way, we need to account for the date change. To do this, I created datetime variables for both the departure and the arrival hours and minutes. This requires some modulus math (“%/%” and “%%”) to get the times in a useful format, which is covered in R4DS Chapter 17. Once we have the departure and arrival times in dttm format, we can calculate the time difference (using difftime). We can then correct for instances where a flight lands the next day by revising any flights with negative difftimes, where we would then add one day to the value by adding 86,400 seconds (using if_else logic). Now, we can see the fastest flight in the dataset was 33 minutes.

Easy way
Hard way

flights %>% 
  mutate(flight_time = arr_time - dep_time) %>% 
  arrange(flight_time) %>% 
  slice_head(n = 5) %>% 
  select(year, month, day, dep_time, arr_time, flight_time)

# A tibble: 5 × 6
   year month   day dep_time arr_time flight_time
  <int> <int> <int>    <int>    <int>       <int>
1  2013     7    17     2400       54       -2346
2  2013    12     9     2400       59       -2341
3  2013     6    12     2338       17       -2321
4  2013    12    29     2332       14       -2318
5  2013    11     6     2335       18       -2317

flights %>%
  mutate(
    dep_hour = dep_time %/% 100, #modulus math to obtain hour
    dep_minute = dep_time %% 100,#modulus math to obtain minute
    arr_hour = arr_time %/% 100, #modulus math to obtain hour
    arr_minute = arr_time %% 100 #modulus math to obtain minute
  ) %>%
  mutate(departure_time_proper = as.POSIXct( #make dttm departure variable
    make_datetime(year,
                  month,
                  day,
                  dep_hour,
                  dep_minute)
    ),
    arrival_time_proper = as.POSIXct( #make dttm arrival variable
    make_datetime(year,
                  month,
                  day,
                  arr_hour,
                  arr_minute)
    )) %>% 
  #calculate difference between departure and arrival
  mutate(difftime_pre = difftime(arrival_time_proper, 
                                 departure_time_proper),
         # then correct negative flight times by adding one day to flights
         #that landed a day after they departed;
         #note one day is equal to 86,400 seconds
         arr_time_next_day_correction = if_else(difftime_pre < 0, 
                                                arrival_time_proper + 86400,
                                                arrival_time_proper),
         #create revised difftime that accounts for flights landing next day
         difftime_post = difftime(arr_time_next_day_correction, departure_time_proper)) %>% 
  arrange(difftime_post) %>% #sort by shortest flights
  slice_head(n = 5) %>% #keep 5 fastest flights for viewing purposes
  relocate(year, month, day, dep_time, arr_time, difftime_post, .before = 1)

# A tibble: 5 × 28
   year month   day dep_time arr_time difftime_post sched_dep_time dep_delay
  <int> <int> <int>    <int>    <int> <drtn>                 <int>     <dbl>
1  2013     8    14     1133     1206 33 mins                 1148       -15
2  2013     1     5     1323     1358 35 mins                 1325        -2
3  2013     2    22     1312     1347 35 mins                 1316        -4
4  2013     9     3     1203     1238 35 mins                 1153        10
5  2013     3    19     1455     1531 36 mins                 1329        86
# ℹ 20 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
#   dep_minute <dbl>, arr_hour <dbl>, arr_minute <dbl>,
#   departure_time_proper <dttm>, arrival_time_proper <dttm>,
#   difftime_pre <drtn>, arr_time_next_day_correction <dttm>

Exercise 3.2.5.4

Was there a flight on every day of 2013?

4 answer: yes, there was a flight everyday in 2013. You can take a variety of approaches to solve this question. Here are two:

Method 1
Method 2

Below we see a list of flights with unique dates each month. We can skim through each month and see that there are flights everyday of the year, because the number of unique flight dates matches the number of days in each calendar month.

Note 2013 is not a leap year, so having 28 unique flight dates in February is consistent with there being 28 calendar days in the month that year.

flights %>% 
  distinct(month, day) %>% 
  group_by(month) %>% 
  count()

# A tibble: 12 × 2
# Groups:   month [12]
   month     n
   <int> <int>
 1     1    31
 2     2    28
 3     3    31
 4     4    30
 5     5    31
 6     6    30
 7     7    31
 8     8    31
 9     9    30
10    10    31
11    11    30
12    12    31

We can create a vector of dates ranging from 1/1/2013 - 12/31/2013. Then we can group the flight data by dates, select the first flight on each date, and join that record to the corresponding date in the date vector. We can then create a logical variable that tells us if each date does or does not have a corresponding flight record. Then we can summarize the logical variable’s values. Since the code below returns n = 365 under “Flights Occurred Today”, and returns no records under “No Flights Today”, we can conclude that there was a flight everyday in 2013.

flights_with_dep_date <- flights %>%
  mutate(departure_date = as.POSIXct( #make dttm departure variable
    make_date(year,
              month,
              day))) %>% 
  group_by(departure_date) %>% 
  slice_head()
  
tibble(seq(as.Date("2013/01/01"), by = "day", length.out = 365)) %>% 
  rename(departure_date = 1) %>% 
  left_join(flights_with_dep_date, by = "departure_date") %>% 
  mutate(date_validation = if_else(is.na(day), "No Flights Today", "Flights Occurred Today")) %>% 
  group_by(date_validation) %>% 
  summarize(n = n())

# A tibble: 1 × 2
  date_validation            n
  <chr>                  <int>
1 Flights Occurred Today   365

Exercise 3.2.5.5

Which flights traveled the farthest distance? Which traveled the least distance?

5 answer: flights from JFK (New York) to HNL (Hawaii) had the longest distance, which totaled 4,983 miles. Flights from EWR (New Jersey) to LGA (New York) traveled the shortest distance, which was 17 miles.

flights %>% 
  arrange(desc(distance)) %>% 
  slice_head() %>% 
  select(distance, origin, dest)

# A tibble: 1 × 3
  distance origin dest 
     <dbl> <chr>  <chr>
1     4983 JFK    HNL

flights %>% 
  arrange(distance) %>% 
  slice_head() %>% 
  select(distance, origin, dest)

# A tibble: 1 × 3
  distance origin dest 
     <dbl> <chr>  <chr>
1       17 EWR    LGA

Exercise 3.2.5.6

Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

6 answer: No it does not matter because filter changes the rows that are present while arrange changes the order of the rows. You can do them in either order and the results will be the same.

3.3.5 Exercises:

Exercise 3.3.5.1

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

1 answer: sched_dep_time is the time the flight was scheduled to leave while dep_time is when the flight actually departed. dep_delay is the difference between dep_time and sched_dep_time.

Exercise 3.3.5.2

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

2 answer: here are 4 ways complete this task:

flights %>% 
  select(dep_time, dep_delay, arr_time, arr_delay)

# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows

flights %>% select(starts_with("dep") | starts_with("arr"))

# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows

flights %>% 
  select(ends_with("_time") | ends_with("delay")) %>% 
  select(!starts_with("sched") & !starts_with("air"))

# A tibble: 336,776 × 4
   dep_time arr_time dep_delay arr_delay
      <int>    <int>     <dbl>     <dbl>
 1      517      830         2        11
 2      533      850         4        20
 3      542      923         2        33
 4      544     1004        -1       -18
 5      554      812        -6       -25
 6      554      740        -4        12
 7      555      913        -5        19
 8      557      709        -3       -14
 9      557      838        -3        -8
10      558      753        -2         8
# ℹ 336,766 more rows

flights %>% 
  select(!year:day & !sched_dep_time & !sched_arr_time & !carrier:time_hour)

# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows

Exercise 3.3.5.3

What happens if you specify the name of the same variable multiple times in a select() call?

3 answer: that variable only shows up in one column regardless, this does not produce an error or cause it to be selected twice.

flights %>% 
  select(year, year)

# A tibble: 336,776 × 1
    year
   <int>
 1  2013
 2  2013
 3  2013
 4  2013
 5  2013
 6  2013
 7  2013
 8  2013
 9  2013
10  2013
# ℹ 336,766 more rows

Exercise 3.3.5.4

What does the any_of() function do? Why might it be helpful in conjunction with this vector?

4 answer: it is a selection helper that allows us to select any variables with names matching those listed in the vector. It is similar to the “all_of” selection helper, but unlike “all_of”, it won’t generate an error if any element of the vector is missing from the dataset.

variables <- c("year", "month", "day", "dep_delay", "arr_delay")

flights %>% 
  select(all_of(variables))

# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

Exercise 3.3.5.5

Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

5 answer: selection helpers in dplyr have ignore.case set to TRUE, meaning case will be ignored when searching for variable names that match the specified string. The default can be changed by setting ignore.case to FALSE.

flights %>% select(contains("TIME"))

# A tibble: 336,776 × 6
   dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
      <int>          <int>    <int>          <int>    <dbl> <dttm>             
 1      517            515      830            819      227 2013-01-01 05:00:00
 2      533            529      850            830      227 2013-01-01 05:00:00
 3      542            540      923            850      160 2013-01-01 05:00:00
 4      544            545     1004           1022      183 2013-01-01 05:00:00
 5      554            600      812            837      116 2013-01-01 06:00:00
 6      554            558      740            728      150 2013-01-01 05:00:00
 7      555            600      913            854      158 2013-01-01 06:00:00
 8      557            600      709            723       53 2013-01-01 06:00:00
 9      557            600      838            846      140 2013-01-01 06:00:00
10      558            600      753            745      138 2013-01-01 06:00:00
# ℹ 336,766 more rows

flights %>%  select(contains("TIME", ignore.case = FALSE))

# A tibble: 336,776 × 0

Exercise 3.3.5.6

Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights %>% 
  rename(air_time_min = air_time) %>% 
  relocate(air_time_min)

# A tibble: 336,776 × 19
   air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
          <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1          227  2013     1     1      517            515         2      830
 2          227  2013     1     1      533            529         4      850
 3          160  2013     1     1      542            540         2      923
 4          183  2013     1     1      544            545        -1     1004
 5          116  2013     1     1      554            600        -6      812
 6          150  2013     1     1      554            558        -4      740
 7          158  2013     1     1      555            600        -5      913
 8           53  2013     1     1      557            600        -3      709
 9          140  2013     1     1      557            600        -3      838
10          138  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 3.3.5.7

Why doesn’t the following work, and what does the error mean?

7 answer: the error means that arr_delay is not included in the dataset. This is because selecting tailnum removes every other variable, including arr_delay, from the dataset. This could be revised with the following code:

# flights %>% 
#   select(tailnum) %>% 
#   arrange(arr_delay)

flights %>% 
  select(tailnum, arr_delay) %>% 
  arrange(arr_delay)

# A tibble: 336,776 × 2
   tailnum arr_delay
   <chr>       <dbl>
 1 N843VA        -86
 2 N840VA        -79
 3 N851UA        -75
 4 N3KCAA        -75
 5 N551AS        -74
 6 N24212        -73
 7 N3760C        -71
 8 N806UA        -71
 9 N805JB        -71
10 N855VA        -70
# ℹ 336,766 more rows