pacman::p_load(tidyverse, nycflights13)Chapter 3
Chapter 3 of R4DS teaches data transformation using packages (such as dplyr) within the tidyverse metapackage. This page will work through a subset of the chapter’s prompts. I will use the nycflights13 dataset to go through some basic data cleaning procedures. I’ll set up by loading the tidyverse and nycflights13 packages.
3.2.5 Exercises:
Exercise 3.2.5.1
In a single pipeline for each condition, find all flights that meet the condition:
Had an arrival delay of two or more hours
Answer: 10,200 flights had a delay of 2+ hours
flights %>%
filter(arr_delay >= 120) %>%
count()# A tibble: 1 × 1
n
<int>
1 10200
Flew to Houston (IAH or HOU)
Answer: 9,313 flights flew to Houston
flights %>%
filter(dest %in% c("IAH", "HOU")) %>%
count()# A tibble: 1 × 1
n
<int>
1 9313
Were operated by United, American, or Delta
Answer: 139,504 flights were operated by these airlines
flights %>%
filter(carrier %in% c("UA", "AA", "DL")) %>%
count()# A tibble: 1 × 1
n
<int>
1 139504
Departed in summer (July, August, and September)
Answer: 86,326 flights left during the summer
flights %>%
filter(month %in% c(7, 8, 9)) %>%
count()# A tibble: 1 × 1
n
<int>
1 86326
Arrived more than two hours late, but didn’t leave late
Answer: 29 flights left on time and arrived 2+ hours late
flights %>%
filter(dep_delay <= 0, arr_delay > 120) %>%
count()# A tibble: 1 × 1
n
<int>
1 29
Were delayed by at least an hour, but made up over 30 minutes in flight
Answer: 1,844 flights left an hour late but made up 30 minutes in flight
flights %>%
filter(dep_delay >= 60 & dep_delay - arr_delay > 30) %>%
count()# A tibble: 1 × 1
n
<int>
1 1844
Exercise 3.2.5.2
Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.
Answer: The longest delay was 1,301 minutes and many flights left at 12:01AM.
flights %>%
arrange(desc(dep_delay))# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 6 27 959 1900 899 1236 2226
9 2013 7 22 2257 759 898 121 1026
10 2013 12 5 756 1700 896 1058 2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
arrange(dep_time)# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 13 1 2249 72 108 2357
2 2013 1 31 1 2100 181 124 2225
3 2013 11 13 1 2359 2 442 440
4 2013 12 16 1 2359 2 447 437
5 2013 12 20 1 2359 2 430 440
6 2013 12 26 1 2359 2 437 440
7 2013 12 30 1 2359 2 441 437
8 2013 2 11 1 2100 181 111 2225
9 2013 2 24 1 2245 76 121 2354
10 2013 3 8 1 2355 6 431 440
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercise 3.2.5.3
Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
Answer: This is a trickier question than it might seem at first. With the coding skills developed in chapters 1-3, the easy solution would be to run the code below on tab 1, which subtracts dep_time from arr_time and then sorts the values in ascending order. I’ve retained the first 5 records for display.
The problem with the easy way is that these columns are formatted as numbers, not times, and it generates nonsensical numbers, and sometimes produces negative numbers for flights that land the day after they depart. Look at the values in the flight_time column in tab 1. This method gives us the wrong answer because there are flights with small positive values that are actually the shortest, but we aren’t finding them because we are focusing on flights that are at the top of the list because our simple formula produced negative values.
If we do it the hard way, we need to account for the date change. To do this, I created datetime variables for both the departure and the arrival hours and minutes. This requires some modulus math (“%/%” and “%%”) to get the times in a useful format, which is covered in R4DS Chapter 17. Once we have the departure and arrival times in dttm format, we can calculate the time difference (using difftime). We can then correct for instances where a flight lands the next day by revising any flights with negative difftimes, where we would then add one day to the value by adding 86,400 seconds (using if_else logic). Now, we can see the fastest flight in the dataset was 33 minutes.
flights %>%
mutate(flight_time = arr_time - dep_time) %>%
arrange(flight_time) %>%
slice_head(n = 5) %>%
select(year, month, day, dep_time, arr_time, flight_time)# A tibble: 5 × 6
year month day dep_time arr_time flight_time
<int> <int> <int> <int> <int> <int>
1 2013 7 17 2400 54 -2346
2 2013 12 9 2400 59 -2341
3 2013 6 12 2338 17 -2321
4 2013 12 29 2332 14 -2318
5 2013 11 6 2335 18 -2317
flights %>%
mutate(
dep_hour = dep_time %/% 100, #modulus math to obtain hour
dep_minute = dep_time %% 100,#modulus math to obtain minute
arr_hour = arr_time %/% 100, #modulus math to obtain hour
arr_minute = arr_time %% 100 #modulus math to obtain minute
) %>%
mutate(departure_time_proper = as.POSIXct( #make dttm departure variable
make_datetime(year,
month,
day,
dep_hour,
dep_minute)
),
arrival_time_proper = as.POSIXct( #make dttm arrival variable
make_datetime(year,
month,
day,
arr_hour,
arr_minute)
)) %>%
#calculate difference between departure and arrival
mutate(difftime_pre = difftime(arrival_time_proper,
departure_time_proper),
# then correct negative flight times by adding one day to flights
#that landed a day after they departed;
#note one day is equal to 86,400 seconds
arr_time_next_day_correction = if_else(difftime_pre < 0,
arrival_time_proper + 86400,
arrival_time_proper),
#create revised difftime that accounts for flights landing next day
difftime_post = difftime(arr_time_next_day_correction, departure_time_proper)) %>%
arrange(difftime_post) %>% #sort by shortest flights
slice_head(n = 5) %>% #keep 5 fastest flights for viewing purposes
relocate(year, month, day, dep_time, arr_time, difftime_post, .before = 1)# A tibble: 5 × 28
year month day dep_time arr_time difftime_post sched_dep_time dep_delay
<int> <int> <int> <int> <int> <drtn> <int> <dbl>
1 2013 8 14 1133 1206 33 mins 1148 -15
2 2013 1 5 1323 1358 35 mins 1325 -2
3 2013 2 22 1312 1347 35 mins 1316 -4
4 2013 9 3 1203 1238 35 mins 1153 10
5 2013 3 19 1455 1531 36 mins 1329 86
# ℹ 20 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
# dep_minute <dbl>, arr_hour <dbl>, arr_minute <dbl>,
# departure_time_proper <dttm>, arrival_time_proper <dttm>,
# difftime_pre <drtn>, arr_time_next_day_correction <dttm>
Exercise 3.2.5.4
Was there a flight on every day of 2013?
4 answer: yes, there was a flight everyday in 2013. You can take a variety of approaches to solve this question. Here are two:
Below we see a list of flights with unique dates each month. We can skim through each month and see that there are flights everyday of the year, because the number of unique flight dates matches the number of days in each calendar month.
Note 2013 is not a leap year, so having 28 unique flight dates in February is consistent with there being 28 calendar days in the month that year.
flights %>%
distinct(month, day) %>%
group_by(month) %>%
count()# A tibble: 12 × 2
# Groups: month [12]
month n
<int> <int>
1 1 31
2 2 28
3 3 31
4 4 30
5 5 31
6 6 30
7 7 31
8 8 31
9 9 30
10 10 31
11 11 30
12 12 31
We can create a vector of dates ranging from 1/1/2013 - 12/31/2013. Then we can group the flight data by dates, select the first flight on each date, and join that record to the corresponding date in the date vector. We can then create a logical variable that tells us if each date does or does not have a corresponding flight record. Then we can summarize the logical variable’s values. Since the code below returns n = 365 under “Flights Occurred Today”, and returns no records under “No Flights Today”, we can conclude that there was a flight everyday in 2013.
flights_with_dep_date <- flights %>%
mutate(departure_date = as.POSIXct( #make dttm departure variable
make_date(year,
month,
day))) %>%
group_by(departure_date) %>%
slice_head()
tibble(seq(as.Date("2013/01/01"), by = "day", length.out = 365)) %>%
rename(departure_date = 1) %>%
left_join(flights_with_dep_date, by = "departure_date") %>%
mutate(date_validation = if_else(is.na(day), "No Flights Today", "Flights Occurred Today")) %>%
group_by(date_validation) %>%
summarize(n = n())# A tibble: 1 × 2
date_validation n
<chr> <int>
1 Flights Occurred Today 365
Exercise 3.2.5.5
Which flights traveled the farthest distance? Which traveled the least distance?
5 answer: flights from JFK (New York) to HNL (Hawaii) had the longest distance, which totaled 4,983 miles. Flights from EWR (New Jersey) to LGA (New York) traveled the shortest distance, which was 17 miles.
flights %>%
arrange(desc(distance)) %>%
slice_head() %>%
select(distance, origin, dest)# A tibble: 1 × 3
distance origin dest
<dbl> <chr> <chr>
1 4983 JFK HNL
flights %>%
arrange(distance) %>%
slice_head() %>%
select(distance, origin, dest)# A tibble: 1 × 3
distance origin dest
<dbl> <chr> <chr>
1 17 EWR LGA
Exercise 3.2.5.6
Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
6 answer: No it does not matter because filter changes the rows that are present while arrange changes the order of the rows. You can do them in either order and the results will be the same.
3.3.5 Exercises:
Exercise 3.3.5.1
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
1 answer: sched_dep_time is the time the flight was scheduled to leave while dep_time is when the flight actually departed. dep_delay is the difference between dep_time and sched_dep_time.
Exercise 3.3.5.2
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
2 answer: here are 4 ways complete this task:
flights %>%
select(dep_time, dep_delay, arr_time, arr_delay)# A tibble: 336,776 × 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# ℹ 336,766 more rows
flights %>% select(starts_with("dep") | starts_with("arr"))# A tibble: 336,776 × 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# ℹ 336,766 more rows
flights %>%
select(ends_with("_time") | ends_with("delay")) %>%
select(!starts_with("sched") & !starts_with("air"))# A tibble: 336,776 × 4
dep_time arr_time dep_delay arr_delay
<int> <int> <dbl> <dbl>
1 517 830 2 11
2 533 850 4 20
3 542 923 2 33
4 544 1004 -1 -18
5 554 812 -6 -25
6 554 740 -4 12
7 555 913 -5 19
8 557 709 -3 -14
9 557 838 -3 -8
10 558 753 -2 8
# ℹ 336,766 more rows
flights %>%
select(!year:day & !sched_dep_time & !sched_arr_time & !carrier:time_hour)# A tibble: 336,776 × 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# ℹ 336,766 more rows
Exercise 3.3.5.3
What happens if you specify the name of the same variable multiple times in a select() call?
3 answer: that variable only shows up in one column regardless, this does not produce an error or cause it to be selected twice.
flights %>%
select(year, year)# A tibble: 336,776 × 1
year
<int>
1 2013
2 2013
3 2013
4 2013
5 2013
6 2013
7 2013
8 2013
9 2013
10 2013
# ℹ 336,766 more rows
Exercise 3.3.5.4
What does the any_of() function do? Why might it be helpful in conjunction with this vector?
4 answer: it is a selection helper that allows us to select any variables with names matching those listed in the vector. It is similar to the “all_of” selection helper, but unlike “all_of”, it won’t generate an error if any element of the vector is missing from the dataset.
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%
select(all_of(variables))# A tibble: 336,776 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# ℹ 336,766 more rows
Exercise 3.3.5.5
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
5 answer: selection helpers in dplyr have ignore.case set to TRUE, meaning case will be ignored when searching for variable names that match the specified string. The default can be changed by setting ignore.case to FALSE.
flights %>% select(contains("TIME"))# A tibble: 336,776 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
7 555 600 913 854 158 2013-01-01 06:00:00
8 557 600 709 723 53 2013-01-01 06:00:00
9 557 600 838 846 140 2013-01-01 06:00:00
10 558 600 753 745 138 2013-01-01 06:00:00
# ℹ 336,766 more rows
flights %>% select(contains("TIME", ignore.case = FALSE))# A tibble: 336,776 × 0
Exercise 3.3.5.6
Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
flights %>%
rename(air_time_min = air_time) %>%
relocate(air_time_min)# A tibble: 336,776 × 19
air_time_min year month day dep_time sched_dep_time dep_delay arr_time
<dbl> <int> <int> <int> <int> <int> <dbl> <int>
1 227 2013 1 1 517 515 2 830
2 227 2013 1 1 533 529 4 850
3 160 2013 1 1 542 540 2 923
4 183 2013 1 1 544 545 -1 1004
5 116 2013 1 1 554 600 -6 812
6 150 2013 1 1 554 558 -4 740
7 158 2013 1 1 555 600 -5 913
8 53 2013 1 1 557 600 -3 709
9 140 2013 1 1 557 600 -3 838
10 138 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercise 3.3.5.7
Why doesn’t the following work, and what does the error mean?
7 answer: the error means that arr_delay is not included in the dataset. This is because selecting tailnum removes every other variable, including arr_delay, from the dataset. This could be revised with the following code:
# flights %>%
# select(tailnum) %>%
# arrange(arr_delay)
flights %>%
select(tailnum, arr_delay) %>%
arrange(arr_delay)# A tibble: 336,776 × 2
tailnum arr_delay
<chr> <dbl>
1 N843VA -86
2 N840VA -79
3 N851UA -75
4 N3KCAA -75
5 N551AS -74
6 N24212 -73
7 N3760C -71
8 N806UA -71
9 N805JB -71
10 N855VA -70
# ℹ 336,766 more rows