Installing the Required Packages

You might need to install the following packages if you don’t already have them:

install.packages("xtable")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("grid")
install.packages("evaluate")

Just uncomment the packages you need and run this chunk before you run the remaining ones in this notebook.

Importing the Required Packages

Once the libraries are installed, they need to be loaded as follows:

suppressMessages(library(xtable))  # Pretty printing dataframes
suppressMessages(library(ggplot2)) # Plotting
suppressMessages(library(gridExtra))
suppressMessages(library(grid))

Overview

The overall goal of this notebook is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999–2008.

Loading Data

load_data <- function(url, file_name) {
  temp_file <- tempfile()
  download.file(url, temp_file)
  unzip(zipfile = temp_file, files = c(file_name), exdir = "./")
  data <- readRDS(file_name)
  unlink(temp_file)
  unlink(file_name)
  data
}

summary_scc_pm25 <- load_data("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
                              "summarySCC_PM25.rds")

source_classification_code <- load_data("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
                                        "Source_Classification_Code.rds")

Data Types

PM2.5 Emissions Data:

str(summary_scc_pm25)
## 'data.frame':    6497651 obs. of  6 variables:
##  $ fips     : chr  "09001" "09001" "09001" "09001" ...
##  $ SCC      : chr  "10100401" "10100404" "10100501" "10200401" ...
##  $ Pollutant: chr  "PM25-PRI" "PM25-PRI" "PM25-PRI" "PM25-PRI" ...
##  $ Emissions: num  15.714 234.178 0.128 2.036 0.388 ...
##  $ type     : chr  "POINT" "POINT" "POINT" "POINT" ...
##  $ year     : int  1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
column description
fips A five-digit number (represented as a string) indicating the U.S. county
SCC The name of the source as indicated by a digit string (see source code classification table)
Pollutant A string indicating the pollutant
Emissions Amount of PM2.5 emitted, in tons
type The type of source (point, non-point, on-road, or non-road)
year The year of emissions recorded

PM2.5 Source Types:

Type Description
POINT The U.S. Environmental Protection Agency (EPA) defines point source pollution as “any single identifiable source of pollution from which pollutants are discharged, such as a pipe, ditch, ship or factory smokestack”. Factories and sewage treatment plants are two common types of point sources.
NON-POINT Caused by rainfall or snowmelt moving over and through the ground. As the runoff moves, it picks up and carries away natural and human-made pollutants, finally depositing them into lakes, rivers, wetlands, coastal waters and ground waters.
ON-ROAD Includes any air pollution emitted by road mobile sources, such as cars, light duty and heavy duty trucks and buses.
NON-ROAD Includes any air pollution emmited by non-road mobile sources, such as aircraft, motorboats (diesel and gasoline), locomotives and construction equipment.


Source Classification Code Table:

str(source_classification_code)
## 'data.frame':    11717 obs. of  15 variables:
##  $ SCC                : Factor w/ 11717 levels "10100101","10100102",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Data.Category      : Factor w/ 6 levels "Biogenic","Event",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Short.Name         : Factor w/ 11238 levels "","2,4-D Salts and Esters Prod /Process Vents, 2,4-D Recovery: Filtration",..: 3283 3284 3293 3291 3290 3294 3295 3296 3292 3289 ...
##  $ EI.Sector          : Factor w/ 59 levels "Agriculture - Crops & Livestock Dust",..: 18 18 18 18 18 18 18 18 18 18 ...
##  $ Option.Group       : Factor w/ 25 levels "","C/I Kerosene",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Option.Set         : Factor w/ 18 levels "","A","B","B1A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SCC.Level.One      : Factor w/ 17 levels "Brick Kilns",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ SCC.Level.Two      : Factor w/ 146 levels "","Agricultural Chemicals Production",..: 32 32 32 32 32 32 32 32 32 32 ...
##  $ SCC.Level.Three    : Factor w/ 1061 levels "","100% Biosolids (e.g., sewage sludge, manure, mixtures of these matls)",..: 88 88 156 156 156 156 156 156 156 156 ...
##  $ SCC.Level.Four     : Factor w/ 6084 levels "","(NH4)2 SO4 Acid Bath System and Evaporator",..: 4455 5583 4466 4458 1341 5246 5584 5983 4461 776 ...
##  $ Map.To             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Last.Inventory.Year: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Created_Date       : Factor w/ 57 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Revised_Date       : Factor w/ 44 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Usage.Notes        : Factor w/ 21 levels ""," ","includes bleaching towers, washer hoods, filtrate tanks, vacuum pump exhausts",..: 1 1 1 1 1 1 1 1 1 1 ...


The only columns that mapper to us are the following:

column description
SCC The name of the source as indicated by a digit string
Short.Name Actual name of the PM2.5 source

Sampling Data

The following auxiliary code samples from a data frame and render a data frame into HTML:

render_table <- function(data, digits = 2) {
  print(xtable(data, digits = digits), type = "html")
}

sample_data_frame <- function(data, size) {
  sample_index <- sample(1:nrow(data), size)
  return(data[sample_index, ])
}

PM2.5 Emissions Data:

render_table(sample_data_frame(summary_scc_pm25, 10))
fips SCC Pollutant Emissions type year
44231021 08059 2230071230 PM25-PRI 0.03 ON-ROAD 2008
13688492 01121 2270003040 PM25-PRI 0.53 NON-ROAD 2005
3221239 55047 2265001010 PM25-PRI 0.00 NON-ROAD 1999
5383956 05059 223000115T PM25-PRI 0.00 ON-ROAD 2002
5748477 17089 223007215X PM25-PRI 0.08 ON-ROAD 2002
425781 13153 223007533X PM25-PRI 0.27 ON-ROAD 1999
5435134 06043 220100121T PM25-PRI 0.01 ON-ROAD 2002
13986785 13195 2265003040 PM25-PRI 0.00 NON-ROAD 2005
16088403 72141 2270002060 PM25-PRI 0.01 NON-ROAD 2005
3735860 48453 223007419B PM25-PRI 0.02 ON-ROAD 1999


Source Classification Code Table:

render_table(sample_data_frame(source_classification_code, 10))
SCC Data.Category Short.Name EI.Sector Option.Group Option.Set SCC.Level.One SCC.Level.Two SCC.Level.Three SCC.Level.Four Map.To Last.Inventory.Year Created_Date Revised_Date Usage.Notes
7471 30501111 Point Concrete Batching /Loading of Dry-batch Truck Industrial Processes - Storage and Transfer Industrial Processes Mineral Products Concrete Batching Loading of Dry-batch Truck
8360 30700750 Point Plywood Operations /Direct Natural Gas-Fired Dryer: Non-specified Pine Species Veneer Industrial Processes - Pulp & Paper Industrial Processes Pulp and Paper and Wood Products Plywood Operations Direct Natural Gas-Fired Dryer: Non-specified Pine Species Veneer
6067 30200767 Point Grain Millings /Fiber Drying: Drying Corn Hulls after Separation from Starch & Gluten Industrial Processes - NEC Industrial Processes Food and Agriculture Grain Millings Fiber Drying: Drying Corn Hulls after Separation from Starch & Gluten
4285 2701472000 Biogenic Biogenic - Barren Land/Beaches (Anderson Land Use Code 72) - Total Miscellaneous Non-Industrial NEC Natural Sources Biogenic Barren Land/Beaches (Anderson Land Use Code 72) Total 11/29/2002 0:00:00
7963 30510407 Point Mineral Prods /Bulk Materials Unloading Op /Scrap Metal Industrial Processes - Storage and Transfer Industrial Processes Mineral Products Bulk Materials Unloading Operation Scrap Metal
10153 40701605 Point Organic Chem Storage - Fixed Roof Tanks - N-Heptane: Breathing Loss Industrial Processes - Storage and Transfer Petroleum and Solvent Evaporation Organic Chemical Storage Fixed Roof Tanks - Alkanes (Paraffins) N-Heptane: Breathing Loss
10376 40714605 Point Organic Chem Storage - Fixed Roof Tanks - Tetrahydrofuran: Breathing Loss Industrial Processes - Storage and Transfer Petroleum and Solvent Evaporation Organic Chemical Storage Fixed Roof Tanks - Miscellaneous Tetrahydrofuran: Breathing Loss
10544 40729605 Point Organic Chem Storage - Floating Roof Tanks - Tetrahydrofuran: Standing Loss Industrial Processes - Storage and Transfer Petroleum and Solvent Evaporation Organic Chemical Storage Floating Roof Tanks - Miscellaneous Tetrahydrofuran: Standing Loss
4678 30100607 Point Chem Manuf /Charcoal Manuf /Crushing Industrial Processes - Chemical Manuf Industrial Processes Chemical Manufacturing Charcoal Manufacturing Crushing
4497 2810001002 Event Forest Wildfires - Wildland Fire Use - ANTHROPOG (Restoration),unsp fire stage, ownership Fires - Wildfires Miscellaneous Area Sources Other Combustion Forest Wildfires Wildland Fire Use - ANTHROPOG (Restoration),unsp fire stage, ownership 2005 4/10/2007 0:00:00 7/27/2008 0:00:00


Exploratory Data Analysis

Total Emissions in the U.S.

  1. Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008.
total_emissions_per_year <- aggregate(Emissions ~ year, summary_scc_pm25, sum)
render_table(total_emissions_per_year)
year Emissions
1 1999 7332966.74
2 2002 5635780.50
3 2005 5454703.08
4 2008 3464205.84


format_label <- function(emission) {
  paste(format(round(emission), big.mark = ",", scientific = FALSE), 'Tons')
}

par(mar = c(4, 1, 4, 1))
total_barplot <- barplot(total_emissions_per_year$Emissions,
                         names.arg = total_emissions_per_year$year,
                         main = "Total Emissions from PM2.5 in the U.S.",
                         xlab ="Year",
                         yaxt = "n",
                         ylim = range(0, 8000000),
                         font.lab = 2,
                         col = "burlywood1",
                         border = NA)

text(x = total_barplot, y = total_emissions_per_year$Emissions,
     label = sapply(total_emissions_per_year$Emissions, format_label),
     pos = 3, cex = 1, col = "black")


The total emissions have gradually decreased from 1999 to 2008.

Total Emissions in Baltimore City

  1. Have total emissions from PM2.5 decreased in the Baltimore City, Maryland (fips == “24510”) from 1999 to 2008? Use the base plotting system to make a plot answering this question.
summary_scc_pm25_baltimore <- summary_scc_pm25[summary_scc_pm25$fips == "24510", ]
total_emissions_per_year_baltimore <- aggregate(Emissions ~ year, summary_scc_pm25_baltimore, sum)
render_table(total_emissions_per_year_baltimore)
year Emissions
1 1999 3274.18
2 2002 2453.92
3 2005 3091.35
4 2008 1862.28


par(mfrow = c(1, 2), mar = c(4, 4, 4, 1), oma = c(2, 2, 2, 0))

barplot(total_emissions_per_year_baltimore$Emissions,
        names.arg = total_emissions_per_year_baltimore$year,
        xlab = NA,
        ylab = NA,
        ylim = range(0, 3500),
        las = 1,
        font.lab = 2,
        col = "burlywood1",
        border = NA)

with(total_emissions_per_year_baltimore,
     plot(year, Emissions,
          xlab = NA,
          ylab = NA,
          xlim = range(1998, 2009),
          ylim = range(0, 3500),
          pch = 20,
          cex = 3,
          col = "burlywood2",
          font.lab = 2,
          las = 1))
with(total_emissions_per_year_baltimore,
     abline(lm(Emissions ~ year),
            lwd = 4,
            col = "burlywood4",
            lty = "dotted"))

mtext("Total Emissions from PM2.5 in Baltimore City, Maryland", outer = TRUE, cex = 1.5, font = 2)

mtext(side = 1, text = "Year", outer = TRUE, cex = 1.2, font = 2)
mtext(side = 2, text = "Tons", outer = TRUE, cex = 1.2, font = 2)


Emissons gradually decreased from 1999 to 2008, with 2005 defying the trend (increased when compared to 2002), but sharply decreased in 2008 to smaller levels than 2002. Overall, emissions are in a decreasing trend, as we can verify by the linear regression plot.

Emissions by Source

  1. Of the four types of sources indicated by the type (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999–2008 for Baltimore City? Which have seen increases in emissions from 1999–2008?

Use the ggplot2 plotting system to make a plot answer this question.

total_emissions_per_year_and_type_baltimore <- aggregate(Emissions ~ year + type, summary_scc_pm25_baltimore, sum)
render_table(total_emissions_per_year_and_type_baltimore)
year type Emissions
1 1999 NONPOINT 2107.62
2 2002 NONPOINT 1509.50
3 2005 NONPOINT 1509.50
4 2008 NONPOINT 1373.21
5 1999 NON-ROAD 522.94
6 2002 NON-ROAD 240.85
7 2005 NON-ROAD 248.93
8 2008 NON-ROAD 55.82
9 1999 ON-ROAD 346.82
10 2002 ON-ROAD 134.31
11 2005 ON-ROAD 130.43
12 2008 ON-ROAD 88.28
13 1999 POINT 296.80
14 2002 POINT 569.26
15 2005 POINT 1202.49
16 2008 POINT 344.98


ggplot(total_emissions_per_year_and_type_baltimore, aes(year, Emissions)) +
  geom_point(col = "burlywood4") +
  geom_smooth(method="lm", col = "burlywood4", fill = "burlywood2") +
  facet_wrap(~ type, scales = "free") + 
  ggtitle("Total Emissions from PM2.5 in Baltimore City, Maryland per Source Type") +
  xlab("Year") +
  ylab("Tons") +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = margin(b = 30, unit = "pt"))) +
  theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))


All but point sources of emissions have decreased over the period from 1999 to 2008.

Motor Vehicle Sources in Baltimore City

  1. How have emissions from motor vehicle sources changed from 1999–2008 in Baltimore City?
motor_vehicles_scc_codes <- source_classification_code$SCC[grepl("[Vv]ehicle",
                                                                 source_classification_code$EI.Sector)]
head(motor_vehicles_scc_codes)
## [1] 2201001000 2201001110 2201001111 2201001112 2201001113 2201001114
## 11717 Levels: 10100101 10100102 10100201 10100202 10100203 ... 79900101
summary_motor_vehicles_pm25 <- subset(summary_scc_pm25, SCC %in% motor_vehicles_scc_codes)
summary_motor_vehicles_pm25_baltimore <- subset(summary_motor_vehicles_pm25, fips == "24510")
total_motor_vehicles_emissions_per_year_baltimore <- aggregate(Emissions ~ year, summary_motor_vehicles_pm25_baltimore, sum)
render_table(total_motor_vehicles_emissions_per_year_baltimore)
year Emissions
1 1999 346.82
2 2002 134.31
3 2005 130.43
4 2008 88.28


par(mar = c(4, 4, 4, 1))

with(total_motor_vehicles_emissions_per_year_baltimore,
     plot(year, Emissions,
          main = "Total Emissions from Motor Vehicles PM2.5 in Baltimore City, Maryland",
          xlab ="Year",
          ylab = "Tons",
          xlim = range(1998, 2009),
          ylim = range(80, 400),
          pch = 20,
          cex = 4,
          col = "burlywood2",
          font.lab = 2,
          las = 1))

with(total_motor_vehicles_emissions_per_year_baltimore,
     lines(year, y = Emissions,
           lwd = 4,
           col = "burlywood2",
           type = "l"))

with(total_motor_vehicles_emissions_per_year_baltimore,
     abline(lm(Emissions ~ year),
            lwd = 3,
            col = "burlywood3",
            lty = "dotted"))

with(total_motor_vehicles_emissions_per_year_baltimore,
     text(x = year + 0.5, y = Emissions + 30,
          label = sapply(Emissions, format_label)))


It’s been consistenly decreasing from 1999 to 2008.

Motor Vehicle Sources in Los Angeles

  1. Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (fips == “06037”). Which city has seen greater changes over time in motor vehicle emissions?
summary_motor_vehicles_pm25_baltimore_and_los_angeles <- subset(summary_motor_vehicles_pm25, fips == "24510" | fips == "06037")

summary_motor_vehicles_pm25_baltimore_and_los_angeles$city <-
  ifelse(summary_motor_vehicles_pm25_baltimore_and_los_angeles$fips == "24510",
     "Baltimore City, Maryland", "Los Angeles, California")

total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles <-
  aggregate(Emissions ~ year + city, summary_motor_vehicles_pm25_baltimore_and_los_angeles, sum)

render_table(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles)
year city Emissions
1 1999 Baltimore City, Maryland 346.82
2 2002 Baltimore City, Maryland 134.31
3 2005 Baltimore City, Maryland 130.43
4 2008 Baltimore City, Maryland 88.28
5 1999 Los Angeles, California 3931.12
6 2002 Los Angeles, California 4274.03
7 2005 Los Angeles, California 4601.41
8 2008 Los Angeles, California 4101.32


fit_baltimore <- lm(Emissions ~ year,
                    data = subset(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
                                  city == "Baltimore City, Maryland"))

fit_los_angeles <- lm(Emissions ~ year,
                    data = subset(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
                                  city == "Los Angeles, California"))
change_baltimore <- coef(fit_baltimore)[2]
change_los_angeles <- coef(fit_los_angeles)[2]

format_change_label <- function(change) {
  paste(format(round(as.numeric(change)), big.mark = ",", scientific = FALSE), "Tons / Year")
}

change_labels <- data.frame(year = 2004,
                            Emissions = Inf,
                            change = c(change_baltimore, change_los_angeles),
                            city = c("Baltimore City, Maryland", "Los Angeles, California"))
change_labels$change <- sapply(change_labels$change, format_change_label)

render_table(change_labels)
year Emissions change city
1 2004.00 Inf -26 Tons / Year Baltimore City, Maryland
2 2004.00 Inf 28 Tons / Year Los Angeles, California


The barplot will contrast the difference in emission levels between Baltimore and Los Angeles:

motor_vehicles_emissions_per_year_baltimore_and_los_angeles_barplot <-
  ggplot(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
         aes(x = as.factor(year), y = Emissions, fill = city)) + 
  geom_bar(stat = "identity", position=position_dodge(), alpha = 0.5) +
  ylim(0, 5000) +
  xlab("Year") +
  ylab("Tons") +
  theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) +
  theme(legend.position = "bottom") +
  scale_fill_manual("legend", values = c("Baltimore City, Maryland" = "burlywood2", "Los Angeles, California" = "burlywood4"))


The scatterplots with regression lines will show how the levels change over the years:

motor_vehicles_emissions_per_year_baltimore_and_los_angeles_scatterplot <-
  ggplot(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles, aes(year, Emissions)) +
  geom_point(col = "burlywood4") +
  geom_smooth(method="lm", col = "burlywood4", fill = "burlywood2") +
  facet_wrap(~ city, scales = "free") + 
  geom_text(aes(year, Emissions, label = change), data = change_labels, vjust = 1) +
  xlab("Year") +
  ylab("Tons") +
  theme(plot.margin = unit(c(1, 1, 2, 1), "cm"))


Here are both plots sharing the same grid:

grid.arrange(motor_vehicles_emissions_per_year_baltimore_and_los_angeles_barplot,
             motor_vehicles_emissions_per_year_baltimore_and_los_angeles_scatterplot, ncol = 1,
             top = textGrob("Total Emissions from Motor Vehicle PM2.5",
                            gp = gpar(fontsize = 16, fontface = "bold")))