You might need to install the following packages if you don’t already have them:
install.packages("xtable")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("grid")
install.packages("evaluate")
Just uncomment the packages you need and run this chunk before you run the remaining ones in this notebook.
Once the libraries are installed, they need to be loaded as follows:
suppressMessages(library(xtable)) # Pretty printing dataframes
suppressMessages(library(ggplot2)) # Plotting
suppressMessages(library(gridExtra))
suppressMessages(library(grid))
The overall goal of this notebook is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999–2008.
load_data <- function(url, file_name) {
temp_file <- tempfile()
download.file(url, temp_file)
unzip(zipfile = temp_file, files = c(file_name), exdir = "./")
data <- readRDS(file_name)
unlink(temp_file)
unlink(file_name)
data
}
summary_scc_pm25 <- load_data("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
"summarySCC_PM25.rds")
source_classification_code <- load_data("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
"Source_Classification_Code.rds")
PM2.5 Emissions Data:
str(summary_scc_pm25)
## 'data.frame': 6497651 obs. of 6 variables:
## $ fips : chr "09001" "09001" "09001" "09001" ...
## $ SCC : chr "10100401" "10100404" "10100501" "10200401" ...
## $ Pollutant: chr "PM25-PRI" "PM25-PRI" "PM25-PRI" "PM25-PRI" ...
## $ Emissions: num 15.714 234.178 0.128 2.036 0.388 ...
## $ type : chr "POINT" "POINT" "POINT" "POINT" ...
## $ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
| column | description |
|---|---|
| fips | A five-digit number (represented as a string) indicating the U.S. county |
| SCC | The name of the source as indicated by a digit string (see source code classification table) |
| Pollutant | A string indicating the pollutant |
| Emissions | Amount of PM2.5 emitted, in tons |
| type | The type of source (point, non-point, on-road, or non-road) |
| year | The year of emissions recorded |
PM2.5 Source Types:
| Type | Description |
|---|---|
| POINT | The U.S. Environmental Protection Agency (EPA) defines point source pollution as “any single identifiable source of pollution from which pollutants are discharged, such as a pipe, ditch, ship or factory smokestack”. Factories and sewage treatment plants are two common types of point sources. |
| NON-POINT | Caused by rainfall or snowmelt moving over and through the ground. As the runoff moves, it picks up and carries away natural and human-made pollutants, finally depositing them into lakes, rivers, wetlands, coastal waters and ground waters. |
| ON-ROAD | Includes any air pollution emitted by road mobile sources, such as cars, light duty and heavy duty trucks and buses. |
| NON-ROAD | Includes any air pollution emmited by non-road mobile sources, such as aircraft, motorboats (diesel and gasoline), locomotives and construction equipment. |
Source Classification Code Table:
str(source_classification_code)
## 'data.frame': 11717 obs. of 15 variables:
## $ SCC : Factor w/ 11717 levels "10100101","10100102",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Data.Category : Factor w/ 6 levels "Biogenic","Event",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Short.Name : Factor w/ 11238 levels "","2,4-D Salts and Esters Prod /Process Vents, 2,4-D Recovery: Filtration",..: 3283 3284 3293 3291 3290 3294 3295 3296 3292 3289 ...
## $ EI.Sector : Factor w/ 59 levels "Agriculture - Crops & Livestock Dust",..: 18 18 18 18 18 18 18 18 18 18 ...
## $ Option.Group : Factor w/ 25 levels "","C/I Kerosene",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Option.Set : Factor w/ 18 levels "","A","B","B1A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SCC.Level.One : Factor w/ 17 levels "Brick Kilns",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ SCC.Level.Two : Factor w/ 146 levels "","Agricultural Chemicals Production",..: 32 32 32 32 32 32 32 32 32 32 ...
## $ SCC.Level.Three : Factor w/ 1061 levels "","100% Biosolids (e.g., sewage sludge, manure, mixtures of these matls)",..: 88 88 156 156 156 156 156 156 156 156 ...
## $ SCC.Level.Four : Factor w/ 6084 levels "","(NH4)2 SO4 Acid Bath System and Evaporator",..: 4455 5583 4466 4458 1341 5246 5584 5983 4461 776 ...
## $ Map.To : num NA NA NA NA NA NA NA NA NA NA ...
## $ Last.Inventory.Year: int NA NA NA NA NA NA NA NA NA NA ...
## $ Created_Date : Factor w/ 57 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Revised_Date : Factor w/ 44 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Usage.Notes : Factor w/ 21 levels ""," ","includes bleaching towers, washer hoods, filtrate tanks, vacuum pump exhausts",..: 1 1 1 1 1 1 1 1 1 1 ...
The only columns that mapper to us are the following:
| column | description |
|---|---|
| SCC | The name of the source as indicated by a digit string |
| Short.Name | Actual name of the PM2.5 source |
The following auxiliary code samples from a data frame and render a data frame into HTML:
render_table <- function(data, digits = 2) {
print(xtable(data, digits = digits), type = "html")
}
sample_data_frame <- function(data, size) {
sample_index <- sample(1:nrow(data), size)
return(data[sample_index, ])
}
PM2.5 Emissions Data:
render_table(sample_data_frame(summary_scc_pm25, 10))
| fips | SCC | Pollutant | Emissions | type | year | |
|---|---|---|---|---|---|---|
| 44231021 | 08059 | 2230071230 | PM25-PRI | 0.03 | ON-ROAD | 2008 |
| 13688492 | 01121 | 2270003040 | PM25-PRI | 0.53 | NON-ROAD | 2005 |
| 3221239 | 55047 | 2265001010 | PM25-PRI | 0.00 | NON-ROAD | 1999 |
| 5383956 | 05059 | 223000115T | PM25-PRI | 0.00 | ON-ROAD | 2002 |
| 5748477 | 17089 | 223007215X | PM25-PRI | 0.08 | ON-ROAD | 2002 |
| 425781 | 13153 | 223007533X | PM25-PRI | 0.27 | ON-ROAD | 1999 |
| 5435134 | 06043 | 220100121T | PM25-PRI | 0.01 | ON-ROAD | 2002 |
| 13986785 | 13195 | 2265003040 | PM25-PRI | 0.00 | NON-ROAD | 2005 |
| 16088403 | 72141 | 2270002060 | PM25-PRI | 0.01 | NON-ROAD | 2005 |
| 3735860 | 48453 | 223007419B | PM25-PRI | 0.02 | ON-ROAD | 1999 |
Source Classification Code Table:
render_table(sample_data_frame(source_classification_code, 10))
| SCC | Data.Category | Short.Name | EI.Sector | Option.Group | Option.Set | SCC.Level.One | SCC.Level.Two | SCC.Level.Three | SCC.Level.Four | Map.To | Last.Inventory.Year | Created_Date | Revised_Date | Usage.Notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7471 | 30501111 | Point | Concrete Batching /Loading of Dry-batch Truck | Industrial Processes - Storage and Transfer | Industrial Processes | Mineral Products | Concrete Batching | Loading of Dry-batch Truck | |||||||
| 8360 | 30700750 | Point | Plywood Operations /Direct Natural Gas-Fired Dryer: Non-specified Pine Species Veneer | Industrial Processes - Pulp & Paper | Industrial Processes | Pulp and Paper and Wood Products | Plywood Operations | Direct Natural Gas-Fired Dryer: Non-specified Pine Species Veneer | |||||||
| 6067 | 30200767 | Point | Grain Millings /Fiber Drying: Drying Corn Hulls after Separation from Starch & Gluten | Industrial Processes - NEC | Industrial Processes | Food and Agriculture | Grain Millings | Fiber Drying: Drying Corn Hulls after Separation from Starch & Gluten | |||||||
| 4285 | 2701472000 | Biogenic | Biogenic - Barren Land/Beaches (Anderson Land Use Code 72) - Total | Miscellaneous Non-Industrial NEC | Natural Sources | Biogenic | Barren Land/Beaches (Anderson Land Use Code 72) | Total | 11/29/2002 0:00:00 | ||||||
| 7963 | 30510407 | Point | Mineral Prods /Bulk Materials Unloading Op /Scrap Metal | Industrial Processes - Storage and Transfer | Industrial Processes | Mineral Products | Bulk Materials Unloading Operation | Scrap Metal | |||||||
| 10153 | 40701605 | Point | Organic Chem Storage - Fixed Roof Tanks - N-Heptane: Breathing Loss | Industrial Processes - Storage and Transfer | Petroleum and Solvent Evaporation | Organic Chemical Storage | Fixed Roof Tanks - Alkanes (Paraffins) | N-Heptane: Breathing Loss | |||||||
| 10376 | 40714605 | Point | Organic Chem Storage - Fixed Roof Tanks - Tetrahydrofuran: Breathing Loss | Industrial Processes - Storage and Transfer | Petroleum and Solvent Evaporation | Organic Chemical Storage | Fixed Roof Tanks - Miscellaneous | Tetrahydrofuran: Breathing Loss | |||||||
| 10544 | 40729605 | Point | Organic Chem Storage - Floating Roof Tanks - Tetrahydrofuran: Standing Loss | Industrial Processes - Storage and Transfer | Petroleum and Solvent Evaporation | Organic Chemical Storage | Floating Roof Tanks - Miscellaneous | Tetrahydrofuran: Standing Loss | |||||||
| 4678 | 30100607 | Point | Chem Manuf /Charcoal Manuf /Crushing | Industrial Processes - Chemical Manuf | Industrial Processes | Chemical Manufacturing | Charcoal Manufacturing | Crushing | |||||||
| 4497 | 2810001002 | Event | Forest Wildfires - Wildland Fire Use - ANTHROPOG (Restoration),unsp fire stage, ownership | Fires - Wildfires | Miscellaneous Area Sources | Other Combustion | Forest Wildfires | Wildland Fire Use - ANTHROPOG (Restoration),unsp fire stage, ownership | 2005 | 4/10/2007 0:00:00 | 7/27/2008 0:00:00 |
total_emissions_per_year <- aggregate(Emissions ~ year, summary_scc_pm25, sum)
render_table(total_emissions_per_year)
| year | Emissions | |
|---|---|---|
| 1 | 1999 | 7332966.74 |
| 2 | 2002 | 5635780.50 |
| 3 | 2005 | 5454703.08 |
| 4 | 2008 | 3464205.84 |
format_label <- function(emission) {
paste(format(round(emission), big.mark = ",", scientific = FALSE), 'Tons')
}
par(mar = c(4, 1, 4, 1))
total_barplot <- barplot(total_emissions_per_year$Emissions,
names.arg = total_emissions_per_year$year,
main = "Total Emissions from PM2.5 in the U.S.",
xlab ="Year",
yaxt = "n",
ylim = range(0, 8000000),
font.lab = 2,
col = "burlywood1",
border = NA)
text(x = total_barplot, y = total_emissions_per_year$Emissions,
label = sapply(total_emissions_per_year$Emissions, format_label),
pos = 3, cex = 1, col = "black")
The total emissions have gradually decreased from 1999 to 2008.
summary_scc_pm25_baltimore <- summary_scc_pm25[summary_scc_pm25$fips == "24510", ]
total_emissions_per_year_baltimore <- aggregate(Emissions ~ year, summary_scc_pm25_baltimore, sum)
render_table(total_emissions_per_year_baltimore)
| year | Emissions | |
|---|---|---|
| 1 | 1999 | 3274.18 |
| 2 | 2002 | 2453.92 |
| 3 | 2005 | 3091.35 |
| 4 | 2008 | 1862.28 |
par(mfrow = c(1, 2), mar = c(4, 4, 4, 1), oma = c(2, 2, 2, 0))
barplot(total_emissions_per_year_baltimore$Emissions,
names.arg = total_emissions_per_year_baltimore$year,
xlab = NA,
ylab = NA,
ylim = range(0, 3500),
las = 1,
font.lab = 2,
col = "burlywood1",
border = NA)
with(total_emissions_per_year_baltimore,
plot(year, Emissions,
xlab = NA,
ylab = NA,
xlim = range(1998, 2009),
ylim = range(0, 3500),
pch = 20,
cex = 3,
col = "burlywood2",
font.lab = 2,
las = 1))
with(total_emissions_per_year_baltimore,
abline(lm(Emissions ~ year),
lwd = 4,
col = "burlywood4",
lty = "dotted"))
mtext("Total Emissions from PM2.5 in Baltimore City, Maryland", outer = TRUE, cex = 1.5, font = 2)
mtext(side = 1, text = "Year", outer = TRUE, cex = 1.2, font = 2)
mtext(side = 2, text = "Tons", outer = TRUE, cex = 1.2, font = 2)
Emissons gradually decreased from 1999 to 2008, with 2005 defying the trend (increased when compared to 2002), but sharply decreased in 2008 to smaller levels than 2002. Overall, emissions are in a decreasing trend, as we can verify by the linear regression plot.
Use the ggplot2 plotting system to make a plot answer this question.
total_emissions_per_year_and_type_baltimore <- aggregate(Emissions ~ year + type, summary_scc_pm25_baltimore, sum)
render_table(total_emissions_per_year_and_type_baltimore)
| year | type | Emissions | |
|---|---|---|---|
| 1 | 1999 | NONPOINT | 2107.62 |
| 2 | 2002 | NONPOINT | 1509.50 |
| 3 | 2005 | NONPOINT | 1509.50 |
| 4 | 2008 | NONPOINT | 1373.21 |
| 5 | 1999 | NON-ROAD | 522.94 |
| 6 | 2002 | NON-ROAD | 240.85 |
| 7 | 2005 | NON-ROAD | 248.93 |
| 8 | 2008 | NON-ROAD | 55.82 |
| 9 | 1999 | ON-ROAD | 346.82 |
| 10 | 2002 | ON-ROAD | 134.31 |
| 11 | 2005 | ON-ROAD | 130.43 |
| 12 | 2008 | ON-ROAD | 88.28 |
| 13 | 1999 | POINT | 296.80 |
| 14 | 2002 | POINT | 569.26 |
| 15 | 2005 | POINT | 1202.49 |
| 16 | 2008 | POINT | 344.98 |
ggplot(total_emissions_per_year_and_type_baltimore, aes(year, Emissions)) +
geom_point(col = "burlywood4") +
geom_smooth(method="lm", col = "burlywood4", fill = "burlywood2") +
facet_wrap(~ type, scales = "free") +
ggtitle("Total Emissions from PM2.5 in Baltimore City, Maryland per Source Type") +
xlab("Year") +
ylab("Tons") +
theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = margin(b = 30, unit = "pt"))) +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
All but point sources of emissions have decreased over the period from 1999 to 2008.
motor_vehicles_scc_codes <- source_classification_code$SCC[grepl("[Vv]ehicle",
source_classification_code$EI.Sector)]
head(motor_vehicles_scc_codes)
## [1] 2201001000 2201001110 2201001111 2201001112 2201001113 2201001114
## 11717 Levels: 10100101 10100102 10100201 10100202 10100203 ... 79900101
summary_motor_vehicles_pm25 <- subset(summary_scc_pm25, SCC %in% motor_vehicles_scc_codes)
summary_motor_vehicles_pm25_baltimore <- subset(summary_motor_vehicles_pm25, fips == "24510")
total_motor_vehicles_emissions_per_year_baltimore <- aggregate(Emissions ~ year, summary_motor_vehicles_pm25_baltimore, sum)
render_table(total_motor_vehicles_emissions_per_year_baltimore)
| year | Emissions | |
|---|---|---|
| 1 | 1999 | 346.82 |
| 2 | 2002 | 134.31 |
| 3 | 2005 | 130.43 |
| 4 | 2008 | 88.28 |
par(mar = c(4, 4, 4, 1))
with(total_motor_vehicles_emissions_per_year_baltimore,
plot(year, Emissions,
main = "Total Emissions from Motor Vehicles PM2.5 in Baltimore City, Maryland",
xlab ="Year",
ylab = "Tons",
xlim = range(1998, 2009),
ylim = range(80, 400),
pch = 20,
cex = 4,
col = "burlywood2",
font.lab = 2,
las = 1))
with(total_motor_vehicles_emissions_per_year_baltimore,
lines(year, y = Emissions,
lwd = 4,
col = "burlywood2",
type = "l"))
with(total_motor_vehicles_emissions_per_year_baltimore,
abline(lm(Emissions ~ year),
lwd = 3,
col = "burlywood3",
lty = "dotted"))
with(total_motor_vehicles_emissions_per_year_baltimore,
text(x = year + 0.5, y = Emissions + 30,
label = sapply(Emissions, format_label)))
It’s been consistenly decreasing from 1999 to 2008.
summary_motor_vehicles_pm25_baltimore_and_los_angeles <- subset(summary_motor_vehicles_pm25, fips == "24510" | fips == "06037")
summary_motor_vehicles_pm25_baltimore_and_los_angeles$city <-
ifelse(summary_motor_vehicles_pm25_baltimore_and_los_angeles$fips == "24510",
"Baltimore City, Maryland", "Los Angeles, California")
total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles <-
aggregate(Emissions ~ year + city, summary_motor_vehicles_pm25_baltimore_and_los_angeles, sum)
render_table(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles)
| year | city | Emissions | |
|---|---|---|---|
| 1 | 1999 | Baltimore City, Maryland | 346.82 |
| 2 | 2002 | Baltimore City, Maryland | 134.31 |
| 3 | 2005 | Baltimore City, Maryland | 130.43 |
| 4 | 2008 | Baltimore City, Maryland | 88.28 |
| 5 | 1999 | Los Angeles, California | 3931.12 |
| 6 | 2002 | Los Angeles, California | 4274.03 |
| 7 | 2005 | Los Angeles, California | 4601.41 |
| 8 | 2008 | Los Angeles, California | 4101.32 |
fit_baltimore <- lm(Emissions ~ year,
data = subset(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
city == "Baltimore City, Maryland"))
fit_los_angeles <- lm(Emissions ~ year,
data = subset(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
city == "Los Angeles, California"))
change_baltimore <- coef(fit_baltimore)[2]
change_los_angeles <- coef(fit_los_angeles)[2]
format_change_label <- function(change) {
paste(format(round(as.numeric(change)), big.mark = ",", scientific = FALSE), "Tons / Year")
}
change_labels <- data.frame(year = 2004,
Emissions = Inf,
change = c(change_baltimore, change_los_angeles),
city = c("Baltimore City, Maryland", "Los Angeles, California"))
change_labels$change <- sapply(change_labels$change, format_change_label)
render_table(change_labels)
| year | Emissions | change | city | |
|---|---|---|---|---|
| 1 | 2004.00 | Inf | -26 Tons / Year | Baltimore City, Maryland |
| 2 | 2004.00 | Inf | 28 Tons / Year | Los Angeles, California |
The barplot will contrast the difference in emission levels between Baltimore and Los Angeles:
motor_vehicles_emissions_per_year_baltimore_and_los_angeles_barplot <-
ggplot(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles,
aes(x = as.factor(year), y = Emissions, fill = city)) +
geom_bar(stat = "identity", position=position_dodge(), alpha = 0.5) +
ylim(0, 5000) +
xlab("Year") +
ylab("Tons") +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) +
theme(legend.position = "bottom") +
scale_fill_manual("legend", values = c("Baltimore City, Maryland" = "burlywood2", "Los Angeles, California" = "burlywood4"))
The scatterplots with regression lines will show how the levels change over the years:
motor_vehicles_emissions_per_year_baltimore_and_los_angeles_scatterplot <-
ggplot(total_motor_vehicles_emissions_per_year_baltimore_and_los_angeles, aes(year, Emissions)) +
geom_point(col = "burlywood4") +
geom_smooth(method="lm", col = "burlywood4", fill = "burlywood2") +
facet_wrap(~ city, scales = "free") +
geom_text(aes(year, Emissions, label = change), data = change_labels, vjust = 1) +
xlab("Year") +
ylab("Tons") +
theme(plot.margin = unit(c(1, 1, 2, 1), "cm"))
Here are both plots sharing the same grid:
grid.arrange(motor_vehicles_emissions_per_year_baltimore_and_los_angeles_barplot,
motor_vehicles_emissions_per_year_baltimore_and_los_angeles_scatterplot, ncol = 1,
top = textGrob("Total Emissions from Motor Vehicle PM2.5",
gp = gpar(fontsize = 16, fontface = "bold")))