This is the exploratory analysis for school in session versus break and the impacts on mean road speeds. This purely focused on the change in speeds by road segment. Routing & traveltime will be investigated elsewhere.

1 Setup

1.1 Libraries

The following libraries are used for in this R markdown document.

library(sf)
library(tidyverse)
library(rgeoboundaries)
library(osmextract)
library(tmap)
library(kableExtra)
library(lubridate)
library(DT)

1.2 Data acquisition & preprocess

We use free and open speed traffic data from Uber for Nairobi, available here: https://movement.uber.com/cities/nairobi/downloads/speeds?lang=en-US&tp[y]=2019&tp[q]=1

Uber provides a toolkit software via npm to generate the respective road segments in a geo format. The package is available here: https://www.npmjs.com/package/movement-data-toolkit. We generated the road segments as geojson.

Important: Uber movement data covers the period 2018-2020. Via the attribute osm_way_id the speed traffic information can be linked to the OpenStreetMap (OSM) road network. It is not clear how well the ids from the covered period match with the current OSM data. OSM ids are stable when attributes or the geometry of a object is changed. Howevery newly added data and deleted objects are potentially ommitted when using uber traffic information from years ago but current OSM road network data.

#uber <- read_csv("movement-speeds-quarterly-by-hod-nairobi-2018-Q1.csv")
uber_jan <- read_csv("movement-speeds-hourly-nairobi-2019-1.csv")
uber_april <- read_csv("movement-speeds-hourly-nairobi-2019-4.csv")
nairobi_roads <- st_read("nairobi_2019.geojson")
## Reading layer `nairobi_2019' from data source 
##   `/home/mreinmuth/giscience/git/nairobi_uber/nairobi_2019.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 402236 features and 5 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 35.45218 ymin: -8785269 xmax: 17570570 ymax: 0.1982375
## Projected CRS: WGS 84 / Pseudo-Mercator
st_crs(nairobi_roads) <- 4326

We look at periods in February and April of 2019.

February period covers school in session
April period covers the holiday season

We further filter out weekends

holiday <- data.frame(
  date=seq(as.Date("2019-02-06"), as.Date("2019-02-26"), by = "day"),
  weekday=wday(seq(as.Date("2019-02-06"), as.Date("2019-02-26"), by = "day"), label=T),
  day = day(seq(as.Date("2019-02-06"), as.Date("2019-02-26"), by = "day"))) |> 
  filter(!(weekday %in% c("Sat", "Sun"))) 

school <- data.frame(date=seq(as.Date("2019-01-02"), as.Date("2019-01-22"), by = "day"), weekday=wday(seq(as.Date("2019-01-02"), as.Date("2019-01-22"), by = "day"), label=T),
  day = day(seq(as.Date("2019-02-06"), as.Date("2019-02-26"), by = "day")))|> 
  filter(!(weekday %in% c("Sat", "Sun")))


school_holiday <- uber_april |>  
  filter(day %in% holiday$day)

school_insession <- uber_jan |>  
  filter(day %in% school$day) 
  
uber_ed <- rbind(
  school_holiday, school_insession
)

1.3 Post-processing

Aggregate all speed values for every weekday by hour for both periods.

uber_ed <- uber_ed |>
  group_by(
    year,
    month,
    hour,
    segment_id,
    start_junction_id,
    end_junction_id,
    osm_way_id,
    osm_start_node_id,
    osm_end_node_id
  ) |>
  dplyr::summarise(mean_speed_kph = mean(speed_kph_mean, na.rm = TRUE),
            .groups = "drop")

uber_ed$month_fact <- factor(uber_ed$month, labels = c("School Semester", "Holiday"))
uber_ed$hour_fact <- as.factor(uber_ed$hour)

nairobi_roads$osmhighway |> table()
## 
##  living_street       motorway  motorway_link        primary   primary_link 
##             22            496            306           2998            198 
##    residential           road      secondary secondary_link        service 
##         269488           1186          11576            312          12920 
##       tertiary  tertiary_link          trunk     trunk_link   unclassified 
##          15164            146           4014            524          82886

2 Analysis

2.1 Speed by hour of day

Distribution of mean speed values for both periods school in session and holidays for every hour of the day as boxplot.

uber_ed |> 
  ggplot(aes(x = hour_fact, y = mean_speed_kph, fill = month_fact)) +
  geom_boxplot() +
  labs(x = "Hour of the Day",
       y = "Mean Speed kph",
       fill = "Period")

School semester morning and afternoon rush hours appear to impact overall congestion. The key takeaway from this boxplot is that we see a positive divergence in mean kph between the school semester and holiday periods primarily from 6 am to 7 am and from 3 pm to 4 pm. This suggests that the school semester may be negatively impact traffic congestion and that further investigation to evaluate these effects is necessary.

Next we take a closer look at the mornings at 6 and 7 am and the afternoons at 3 and 4 pm.

holiday_morn <- school_holiday |>  
  filter(hour %in% c(6,7)) |> 
  group_by(hour, segment_id, start_junction_id, end_junction_id, 
           osm_way_id, osm_start_node_id, osm_end_node_id) |>  
  dplyr::summarise(mean_speed_kph = mean(speed_kph_mean, na.rm = TRUE), .groups = "drop")

school_morn <- school_insession |> 
  filter(hour %in% c(6,7)) |> 
  group_by(hour, segment_id, start_junction_id, end_junction_id, 
           osm_way_id, osm_start_node_id, osm_end_node_id)  |>  
  dplyr::summarise(mean_speed_kph = mean(speed_kph_mean, na.rm = TRUE), .groups = "drop")

holiday_afternoon <- school_holiday |>  
  filter(hour %in% c(15,16)) |> 
  group_by(hour, segment_id, start_junction_id, end_junction_id, 
           osm_way_id, osm_start_node_id, osm_end_node_id) |>  
  dplyr::summarise(mean_speed_kph = mean(speed_kph_mean, na.rm = TRUE), .groups = "drop")

school_afternoon <- school_insession |> 
  filter(hour %in% c(15,16)) |> 
  group_by(hour, segment_id, start_junction_id, end_junction_id, 
           osm_way_id, osm_start_node_id, osm_end_node_id)  |>  
  dplyr::summarise(mean_speed_kph = mean(speed_kph_mean, na.rm = TRUE), .groups = "drop")

2.2 Mornings

Distribution of road segments according to their mean speed in the mornings

# Add an additional column to each dataframe for grouping
school_morn$group <- "School Semester"
holiday_morn$group <- "Holiday"


# Combine the dataframes
combined_df <- rbind(school_morn, 
                     holiday_morn)

mean_vals <- combined_df |>
  group_by(group) |>
  dplyr::summarise(value = mean(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "mean")

median_vals <- combined_df |>
  group_by(group) |>
  dplyr::summarise(value = median(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "median")

statistics_df <- bind_rows(mean_vals, median_vals)

# Plot the histogram
ggplot(combined_df, aes(x = mean_speed_kph, fill = group)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  geom_vline(data = statistics_df, aes(xintercept = value, color = group, linetype = type),
             size = 1) +
  theme_minimal() +
  labs(x = "Mean Speed (kph)", 
       y = "Count", 
       fill = "Group",
       color = "Group",
       linetype = "Statistic",
       title = "Distribution of Mean Speed during School v Holiday Mornings") +
  scale_fill_manual(values = c("School Semester" = "dodgerblue3",
                               "Holiday" = "firebrick1")) + 
  scale_color_manual(values = c("School Semester" = "dodgerblue3",
                                "Holiday" = "firebrick1")) +
  scale_linetype_manual(values = c("mean" = "solid",
                                   "median" = "dashed")) +
  theme(legend.position = "top",
        legend.key.size = unit(2, "lines"),  # Increase legend key size
        legend.text = element_text(size = 12)  # Increase legend text size
  )

Concentration of slow roads in morning. A histogram of mean speeds for all road segments show overlapping but distinct distributions, means, and medians between the school semester and holiday period during the morning rush hour. The number of road segments measured for the school semester and holiday period were 16717 and 18356 segments respectively. The school semester had higher numbers of low speed roads despite the lower total road segments measured.

2.3 Afternoon

Distribution of road segments according to their mean speed in the afternoon

# Add an additional column to each dataframe for grouping
school_afternoon$group <- "School Semester"
holiday_afternoon$group <- "Holiday"


# Combine the dataframes
combined_df <- rbind(school_afternoon, 
                     holiday_afternoon)

mean_vals <- combined_df |>
  group_by(group) |>
  dplyr::summarise(value = mean(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "mean")

median_vals <- combined_df |>
  group_by(group) |>
  dplyr::summarise(value = median(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "median")

statistics_df <- bind_rows(mean_vals, median_vals)

# Plot the histogram
ggplot(combined_df, aes(x = mean_speed_kph, fill = group)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  geom_vline(data = statistics_df, aes(xintercept = value, color = group, linetype = type),
             size = 1) +
  theme_minimal() +
  labs(x = "Mean Speed (kph)", 
       y = "Count", 
       fill = "Group",
       color = "Group",
       linetype = "Statistic",
       title = "Distribution of Mean Speed during School v Holiday Afternoon") +
  scale_fill_manual(values = c("School Semester" = "dodgerblue3",
                               "Holiday" = "firebrick1")) + 
  scale_color_manual(values = c("School Semester" = "dodgerblue3",
                                "Holiday" = "firebrick1")) +
  scale_linetype_manual(values = c("mean" = "solid",
                                   "median" = "dashed")) +
  theme(legend.position = "top",
        legend.key.size = unit(2, "lines"),  # Increase legend key size
        legend.text = element_text(size = 12)  # Increase legend text size
  )

Differences less notable in afternoon. A histogram of mean speeds for all road segments for the afternoon shows less difference between the two distributions and their summary statistics. The number of road segments measured for the school semester and holiday period were 24099 and 27175 segments respectively. The holiday period had higher numbers of high speed road segments as compared to the school semester part of this may be do the higher number of roads measured, but the difference in total road segments measured doesn’t account for the difference in distribution of low and high speed road segments for the holiday period.

morn_diff <- holiday_morn |> 
  left_join(school_morn, by = c("osm_start_node_id", "osm_end_node_id")) |> 
  mutate(mean_speed_kph = mean_speed_kph.x - mean_speed_kph.y) |> 
  filter(is.na(mean_speed_kph) == FALSE)

afternoon_diff <- holiday_afternoon |> 
  left_join(school_afternoon, by = c("osm_start_node_id", "osm_end_node_id")) |> 
  mutate(mean_speed_kph = mean_speed_kph.x - mean_speed_kph.y) |> 
  filter(is.na(mean_speed_kph) == FALSE)

2.4 Mean speed difference distribution mornings and afternoon

morn_diff$period <- "Morning"
afternoon_diff$period <- "Afternoon"


# Combine the dataframes
combined_diff <- rbind(morn_diff, 
                     afternoon_diff)

mean_vals_diff <- combined_diff |>
  group_by(period) |>
  dplyr::summarise(value = mean(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "mean")

median_vals_diff <- combined_diff |>
  group_by(period) |>
  dplyr::summarise(value = median(mean_speed_kph, na.rm = TRUE)) |>
  mutate(type = "median")

statistics_diff_df <- bind_rows(mean_vals_diff, median_vals_diff)

# Plot the histogram
ggplot(combined_diff, aes(x = mean_speed_kph, fill = period)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 50) +
  geom_vline(data = statistics_diff_df, aes(xintercept = value, color = period, linetype = type),
             size = 1) +
  theme_minimal() +
  labs(x = "Difference in Mean Speed (kph)", 
       y = "Count", 
       fill = "Period",
       color = "Period",
       linetype = "Statistic",
       title = "Distribution of Mean kph Difference Holiday v School Semester") +
  scale_fill_manual(values = c("Morning" = "dodgerblue3",
                               "Afternoon" = "firebrick1")) + 
  scale_color_manual(values = c("Morning" = "dodgerblue3",
                                "Afternoon" = "firebrick1")) +
  scale_linetype_manual(values = c("mean" = "solid",
                                   "median" = "dashed")) +
  theme(legend.position = "top",
        legend.key.size = unit(2, "lines"),  # Increase legend key size
        legend.text = element_text(size = 12)  # Increase legend text size
  )

Majority of road segments slower during school semester Road segments during the school semester versus the holiday were both slower on average and slower for the majority of roads during the school semester as compared to the holiday period. The difference in speeds was more pronounced in the morning than it was in the afternoon.

summary_fun <- function(df, name) {
  df |> 
    dplyr::summarise(
      mean = mean(mean_speed_kph, na.rm = TRUE), 
      median = median(mean_speed_kph, na.rm = TRUE),
      se = sd(mean_speed_kph, na.rm = TRUE) / sqrt(n()),
      .groups = "drop"
    ) |> 
    mutate(data_frame = name)
}

summary_df1 <- summary_fun(school_morn, "School Semester Morning")
summary_df2 <- summary_fun(holiday_morn, "Holiday Morning")
summary_df3 <- summary_fun(school_afternoon, "School Semester Afternoon")
summary_df4 <- summary_fun(holiday_afternoon, "Holiday Afternoon")
summary_df5 <- summary_fun(morn_diff, "School v Holiday Morning")
summary_df6 <- summary_fun(afternoon_diff, "School v Holiday Afternoon")


combined_summary <- bind_rows(summary_df1, summary_df2, summary_df3, summary_df4, summary_df5, summary_df6)

pivoted_summary <- combined_summary |>
  pivot_longer(-data_frame, names_to = "statistic", values_to = "value") |>
  pivot_wider(names_from = "data_frame", values_from = "value")

print(pivoted_summary)
## # A tibble: 3 × 7
##   statistic `School Semester Morning` `Holiday Morning` School Semester Aftern…¹
##   <chr>                         <dbl>             <dbl>                    <dbl>
## 1 mean                         34.9              36.7                    31.2   
## 2 median                       32.1              34.1                    28.0   
## 3 se                            0.111             0.104                   0.0893
## # ℹ abbreviated name: ¹`School Semester Afternoon`
## # ℹ 3 more variables: `Holiday Afternoon` <dbl>,
## #   `School v Holiday Morning` <dbl>, `School v Holiday Afternoon` <dbl>

Differences in speed were signficant across time periods Based on a quick eye test despite relatively small differences in speed magnitudally (only a few kph difference) all differences appear significant due to the high road segment sample sizes and small standard errors relative to mean differences

morn_diff <- morn_diff |> 
  left_join(nairobi_roads, by = c("osm_start_node_id" = "osmstartnodeid", "osm_end_node_id" = "osmendnodeid")) |> 
  st_as_sf() |> 
  st_drop_geometry() |> 
  na.omit()

afternoon_diff <- afternoon_diff |> 
  left_join(nairobi_roads, by = c("osm_start_node_id" = "osmstartnodeid", "osm_end_node_id" = "osmendnodeid")) |> 
  st_as_sf() |> 
  st_drop_geometry() |> 
  na.omit()

morn_cutoff_10 <- quantile(morn_diff$mean_speed_kph, 0.90)
morn_cutoff_20 <- quantile(morn_diff$mean_speed_kph, 0.80)

afternoon_cutoff_10 <- quantile(afternoon_diff$mean_speed_kph, 0.90)
afternoon_cutoff_20 <- quantile(afternoon_diff$mean_speed_kph, 0.80)

# Filter rows where 'value' is greater than the cutoff
morn_top_10 <- morn_diff |>
  filter(mean_speed_kph > morn_cutoff_10)

morn_top_20 <- morn_diff |>
  filter(mean_speed_kph > morn_cutoff_20)

morn_top10highway <- morn_top_10 |>
  group_by(osmhighway) |>
  dplyr::summarise(top10_n = n(), .groups = "drop") 

morn_top20highway <- morn_top_20 |>
  group_by(osmhighway) |>
  dplyr::summarise(top20_n = n(), .groups = "drop") 

afternoon_top_10 <- afternoon_diff |>
  filter(mean_speed_kph > afternoon_cutoff_10)

afternoon_top_20 <- afternoon_diff |>
  filter(mean_speed_kph > afternoon_cutoff_20)

afternoon_top10highway <- afternoon_top_10 |>
  group_by(osmhighway) |>
  dplyr::summarise(top10_n = n(), .groups = "drop") 

afternoon_top20highway <- afternoon_top_20  |> 
  group_by(osmhighway)  |> 
  dplyr::summarise(top20_n = n(), .groups = "drop")

binom_test_func <- function(k, n, p) {
  if(k == 0){
    return(1)
  } else {
    p_val <- binom.test(x = k, n = n, p = p, alternative = "two.sided")$p.value
    return(p_val)
  }
}


morn_roadcounts <- morn_diff  |> 
  group_by(osmhighway) |> 
  dplyr::summarise(n = n(), .groups = "drop") 
  

morn_roadcounts <- morn_roadcounts |> 
  left_join(morn_top10highway, by = "osmhighway") |>
  left_join(morn_top20highway, by = "osmhighway") |> 
  mutate(expected_n10 = n * 0.1,
         percent_top10 = top10_n / n,
         expected_n20 = n * 0.2,
         percent_top20 = top20_n / n,
         p10 = expected_n10 / n,
         p20 = expected_n10 / n) |> 
 mutate_at(c(3,4,6), ~replace_na(.,0))

morn_roadcounts <- morn_roadcounts  |> 
  mutate(
    binom_test_pval_top10 = mapply(binom_test_func, top10_n, n, p10),
    binom_test_pval_top20 = mapply(binom_test_func, top20_n, n, p20),
    rep_top10_morn = if_else(top10_n < expected_n10, "underrepresented", "overrepresented"),
    rep_top20_morn = if_else(top20_n < expected_n20, "underrepresented", "overrepresented"))


afternoon_roadcounts <- afternoon_diff  |> 
  group_by(osmhighway) |> 
  dplyr::summarise(n = n(), .groups = "drop") 

afternoon_roadcounts <- afternoon_roadcounts |> 
  left_join(afternoon_top10highway, by = "osmhighway") |>
  left_join(afternoon_top20highway, by = "osmhighway") |> 
  mutate(expected_n10 = n * 0.1,
         percent_top10 = top10_n / n,
         expected_n20 = n * 0.2,
         percent_top20 = top20_n / n,
         p10 = expected_n10 / n,
         p20 = expected_n10 / n) |> 
 mutate_at(c(3,4,6), ~replace_na(.,0))

afternoon_roadcounts <- afternoon_roadcounts  |> 
  mutate(
    binom_test_pval_top10 = mapply(binom_test_func, top10_n, n, p10),
    binom_test_pval_top20 = mapply(binom_test_func, top20_n, n, p20),
    rep_top10_afternoon = if_else(top10_n < expected_n10, "underrepresented", "overrepresented"),
    rep_top20_afternoon = if_else(top20_n < expected_n20, "underrepresented", "overrepresented")
  )

2.5 Binominal test


combined_tables <- left_join(
  select(morn_roadcounts, osmhighway, top10_morn = top10_n, expected10_morn = expected_n10, rep_top10_morn, binom_test_pval_top10_morn = binom_test_pval_top10, top20_morn = top20_n, expected20_morn = expected_n20, rep_top20_morn,
         binom_test_pval_top20_morn = binom_test_pval_top20),
  select(afternoon_roadcounts, osmhighway, top10_afternoon = top10_n, expected10_afternoon = expected_n10, rep_top10_afternoon, binom_test_pval_top10_afternoon = binom_test_pval_top10, top20_afternoon = top20_n, expected20_afternoon = expected_n20, rep_top20_afternoon,
         binom_test_pval_top20_afternoon = binom_test_pval_top20),
  by = "osmhighway"
)


combined_tables |> datatable(width = "100%", extensions = "FixedColumns",
    options = list(
      paging = TRUE, searching = TRUE, info = FALSE,
      sort = TRUE, scrollX = TRUE, fixedColumns = list(leftColumns = 2)
    ))

Binomial test shows that or many road types they are significantly over and underepresented in the top 10 and 20 percentile of road segments with the highest speed differences between the school semester and holiday period To clarify the percentiles here are taking the top 10% and 20% of road segments with the greatest positive mean difference in speed (experienced faster mean speeds kph) between the school semester and the holiday period. These are the roads that saw the greatest reduction in traffic congestion during the holiday period as compared to the school semester. Secondary and primary roads were significantly overrepresented and tertiary and motorways significant underrepresented in the top 10 percentile of road segments with the highest speed differences between the school semester and holiday period in the morning. In the top 20 percentile for the morning, trunk roads and residential became significantly overrepresented while secondary and primary roads remain significantly over represented. Tertiary and motorways continued to be significantly under represented in the morning In the afternoon across both the top 10 and 20 percentiles, motorways, primary and tertiary roads were significantly underrepresented, while residential, secondary and trunk roads were significantly overrepresented.

In conclusion we see varying traffic congestion patterns across the city in the morning and afternoon, but across time periods, secondary roads tend to be highly overrepresented while motorways and tertiary roads tend to be underrepresented.

3 Other

Other analysis/viz that was conducted so far

3.1 Isochrone visualization

Isochrones of 300, 600 and 1200 seconds for Nairobi for every hour in the first quater of 2018. The speed values are aggregated by hour of weekday for the first quarter of 2018.

Nairobi Uber

Charles Hatfield & Marcel Reinmuth

2023-07-28