Main analysis (Exploratory Data Analysis)
Time series trend across states
Here we make a bar chart animation on the three pollutant variables over the past 18 years over states.
PATTERN DISCOVERY
Click button to view animation
From animation we can find:
The SO2 concentration decreases apparently over the past 10 years.
NO3 and particle pollution have stable distribution during the past 10 years.
Stringent emission regulation can make to positive difference.
Natural factors explain to the pollution fluctuation to some degree.
Time Series on chosen states for each polluatnt
From the Pattern Discovery animation, we could see that for SO2, most states experienced a significant drop in the concentration level. We decided to choose the top state at 2008 and the top state at 2018, and draw their time series to see their specific pattern during the 10-year-span.
library(tidyverse)
pollutionData <- read.csv("Dataset/main_pollutant.csv")
PAData <- subset(pollutionData, State == 'PA')
year.month.day.str <- format(as.Date(PAData$Time), "%y%m%d")
year.month.day <- as.Date(year.month.day.str, tryFormats = "%y%m%d")
ggplot(PAData, aes(x = year.month.day, y = SO2)) + geom_line() +
labs(x = "Year", y = "SO2") +
ggtitle("SO2 in Pennsylvania Time Series")
Pennsylvania had the highest average SO2 concentration of 6.51 in 2008, but the number dropped to 0.63 in 2018. From the time series we have above, there is a clear decreasing trend with some seasonal fluctuation.
library(ggplot2)
MEData <- subset(pollutionData, State == 'ME')
year.month.day.str <- format(as.Date(MEData$Time), "%y%m%d")
year.month.day <- as.Date(year.month.day.str, tryFormats = "%y%m%d")
ggplot(MEData, aes(x = year.month.day, y = SO2)) + geom_line() +
labs(x = "Year", y = "Concentration") +
ggtitle("SO2 in Maine Time Series")
In 2008, Maine was not even on the top 10 state list, but it unexpectedly became the state with the highest level of SO2 concentration in 2018. From the time series of Maine, we could see that it actually had a quite stable SO2 concentration throughout these years, with the only special occasion of a jump up value in 2018. Notice that the absolute value of concentration in Maine is not very high, it indicates a nation-wide large decrease in SO2 concentration level.
As for NO3, the ranking and the concentration level for each state are relatively stable compared to that of SO2, so we decided to choose only Illinois, the state with the highest NO3 concentration in 2018, to study the trend.
ILData <- subset(pollutionData, State == 'IL')
year.month.day.str <- format(as.Date(ILData$Time), "%y%m%d")
year.month.day <- as.Date(year.month.day.str, tryFormats = "%y%m%d")
ggplot(ILData, aes(x = year.month.day, y = SO2)) + geom_line() +
labs(x = "Year", y = "Concentration") +
ggtitle("NO3 in Illinois Time Series")
We could see from the graph above that there is a decreasing trend for NO3 concentration in Illinois, and the concentration value becomes more stable.
For particle pollutants, Florida stands out to be the state having the highest concentration throughout these years, so we choose Florida to be studied.
FLData <- subset(pollutionData, State == 'FL')
year.month.day.str <- format(as.Date(FLData$Time), "%y%m%d")
year.month.day <- as.Date(year.month.day.str, tryFormats = "%y%m%d")
ggplot(FLData, aes(x = year.month.day, y = Particles)) + geom_line() +
labs(x = "Year", y = "Concentration") +
ggtitle("Particle pollutants in Florida Time Series")
Very different from the other graphs we drew before, this time series of particle pollutants for Florida shows a more frequent fluctuation on concentration values. Also, there is no clear increasing or decreasing trends.
Seasonal patterns on concentration level
From the above time series graph, we could see fluctuations in concentration levels, we expect there might be a possible seasonal pattern of fluctuations. Therefore, we choose 3 states with the highest average concentration level in each of the 3 pollutants and draw their time series graph by year. We expect to see similar trends in the graph that the concentration levels increase and decrease at around the same time during each year.
Top 3 SO2 States Time Series by Year
# SO2 in OH
OHData <- subset(pollutionData, State == 'OH')
month.day.str <- format(as.Date(OHData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
OHData$Year <- as.factor(OHData$Year)
ggplot(OHData, aes(x = month.day, y = SO2)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "SO2") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("SO2 in Ohio Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# SO2 in PA
PAData <- subset(pollutionData, State == 'PA')
month.day.str <- format(as.Date(PAData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
PAData$Year <- as.factor(PAData$Year)
ggplot(PAData, aes(x = month.day, y = SO2)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "SO2") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("SO2 in Pennsylvania Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# SO2 in MD
MDData <- subset(pollutionData, State == 'MD')
month.day.str <- format(as.Date(MDData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
MDData$Year <- as.factor(MDData$Year)
ggplot(MDData, aes(x = month.day, y = SO2)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "SO2") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("SO2 in Maryland Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
For SO2, the graphs for each year have a similar shape, and there is a distinct drop in concentration level for every one of the states. There is indeed a seasonal pattern for SO2 concentration. SO2 concentration tends to get lower in summer and raise up in winter.
Top 3 NO3 States Time Series by Year
# NO3 in IL
ILData <- subset(pollutionData, State == 'IL')
month.day.str <- format(as.Date(ILData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
ILData$Year <- as.factor(ILData$Year)
ggplot(ILData, aes(x = month.day, y = NO3)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "NO3") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("NO3 in Illinois Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# NO3 in NJ
NJData <- subset(pollutionData, State == 'NJ')
month.day.str <- format(as.Date(NJData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
NJData$Year <- as.factor(NJData$Year)
ggplot(NJData, aes(x = month.day, y = NO3)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "NO3") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("NO3 in New Jersey Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# NO3 in IN
INData <- subset(pollutionData, State == 'IN')
month.day.str <- format(as.Date(INData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
INData$Year <- as.factor(INData$Year)
ggplot(INData, aes(x = month.day, y = NO3)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "NO3") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("NO3 in Indiana Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
For NO3, clear patterns could be seen in Illinois and Indiana. The pattern for New Jersey is not as clear as the other two because there are some particular data points such as April 2008 and November 2012 seems not quite fit in the pattern, but overall we still believe there exists seasonal pattern for NO3. Similar to the patterns we see for SO2, NO2 concentration tends to be lower in summer and increases in winter.
Top 3 Particles Pollutants States Time Series by Year
# Particle in FL
FLData <- subset(pollutionData, State == 'FL')
month.day.str <- format(as.Date(FLData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
FLData$Year <- as.factor(FLData$Year)
ggplot(FLData, aes(x = month.day, y = Particles)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "Particles") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("Particles in Florida Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Particle in TX
TXData <- subset(pollutionData, State == 'TX')
month.day.str <- format(as.Date(TXData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
TXData$Year <- as.factor(TXData$Year)
ggplot(TXData, aes(x = month.day, y = Particles)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "Particles") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("Particles in Texas Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Particle in TN
TNData <- subset(pollutionData, State == 'TN')
month.day.str <- format(as.Date(TNData$Time), "%m%d")
month.day <- as.Date(month.day.str, tryFormats = "%m%d")
TNData$Year <- as.factor(TNData$Year)
ggplot(TNData, aes(x = month.day, y = Particles)) +
geom_line(aes(color=Year)) +
labs(x = "Month", y = "Particles") +
scale_x_date(date_breaks = "months" , date_labels = "%b") +
ggtitle("Particles in Tennessee Time Series for each year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
For particle pollutants, the patterns are very different from those of SO2 and NO3. The concentration tends to fluctuate around a certain value throughout the year. Florida’s graph shows a quite messy pattern whereas Texas and Tennessee have a comparatively clearer pattern with highs and lows at the around the same time of the year. However, the seasonal patterns are much weaker and not so obvious compared to those for the other two pollutants.
Seasonal Patterns for different polluatnts
After seeing seasonal patterns for each pollutant in particular states, we want ot see if pollutants actually share the same seasonal pattern with each other. Notice that Illinois has quite high concentration on all three pollutants, so we draw a time series of all three polluatnts in Illinois on the same graph.
library(dplyr)
library(tidyr)
ILData <- subset(pollutionData, State == 'IL')
tidyIL <- ILData %>% gather(key = "Pollutants", value = "Concentration", -X, -State, -Time, -Year, -Month)
year.month.day.str <- format(as.Date(tidyIL$Time), "%y%m%d")
year.month.day <- as.Date(year.month.day.str, tryFormats = "%y%m%d")
tidyIL$Year <- as.factor(tidyIL$Year)
ggplot(tidyIL, aes(x = year.month.day, y = Concentration)) +
geom_line(aes(color = Pollutants)) +
labs(x = "Time", y = "Concentration") +
ggtitle("All pollutants in Illinois Time Series")
From the graph above, we could see that NO3 and SO2 do share a very similar seasonal pattern that with highs around the winter time and lows around summer time throughout the year. However, particle pollutants behave differently than with highs around the summer time and lows around the winter time.
Several reasons could cause the above seasonality. For example, the usage of coal and petroleum is generally higher in the winter and lower in the summer. Thus, the variation in the consumption of combustion materials can directly relate to the variation of SO2 and NO3 concentration. In general, all three types of pollutant show a general correlation with temperature, relative humidity, precipitation, wind circulation, etc.
It has been believed that the majority of the time, particulate matter shows a lower value at summer and a higher value around winter time because the cold temperature at winter time will directly result in condensation of particles, which leads to a higher detection rate. However, our results show that for various particles the behavior is different. With the Shiny application, we have discovered that Mg and K show significant higher concentration around summer but lower concentration. NA behaves opposite for IL but shows higher value around spring time for AL. This irregularity is also confirmed from the Harvard research paper Atmospheric Environment (Amos P.K. Tai, 2010). Within the paper, the author has indicated that the correlation with temperature, relative humidity, wind direction, precipitation differ for individual Particulate Matter and it’s the corresponding region. Thus, at this point, the best way to further discover a trend is through regression on considering various variables. This could be an interesting future discussion.
Spatial distribution
To see the spatial distribution overstates, we draw the linked micromaps over three pollutants, and show only the plot of SO2. The maps can be used for further stufy of geographic clustering trend.
# preprocess Particles, SO2, NO3, calculate mean over all past years
Particles_mean <- aggregate(Particles~State, pollutionData, mean)
SO2_mean <- aggregate(SO2~State, pollutionData, mean)
NO3_mean <- aggregate(NO3~State, pollutionData, mean)
colnames(Particles_mean) <- c("state", "Particles")
colnames(SO2_mean) <- c("state", "SO2")
colnames(NO3_mean) <- c("state", "NO3")
pollutant_mean <- merge(Particles_mean, SO2_mean)
pollutant_mean <- merge(pollutant_mean, NO3_mean)
library(tools)
library(micromap)
data("USstates")
statePolys <- create_map_table(USstates, IDcolumn = "ST")
lmplot(stat.data = pollutant_mean,
map.data = statePolys,
panel.types = c('dot_legend', 'labels', 'dot', 'map'),
panel.data = list(NA, 'state','SO2', NA),
ord.by = 'SO2',
rev.ord = TRUE,
grouping = 5,
colors = brewer.pal(5, "Set1"),
median.row = TRUE,
map.link = c('state', 'ID'),
# how to merge the two data frames:
map.color2 = "white",
# attributes of the panels:
panel.att = list(
list(1, point.type=20,
point.border=TRUE),
list(2, header = "States",
panel.width = 0.2,
panel.header.face = "bold",
panel.header.size = 1.2,
graph.grid.color = "grey99",
align = "left"),
list(3, header = "SO2 Pollution Amount",
panel.header.face = "bold",
panel.header.size = 1.2,
xaxis.title = 'ug/m^3'),
list(4, header = 'Geographical Location',
panel.header.face = "bold",
panel.header.size = 1.2)
))
Industrial and natural influence
SO2 pollutant
Based on the research of the Department of Environment and Energy has discovered, we know that about 99% of the sulfur dioxide or its related family in air comes from human sources. For example, the main source of sulfur dioxide in the air could comes from generation of electricity from coal, oil or gas that contains sulfur. From the graph we can see that the OH, PA, MD, NJ, IN are the top 5 states that generate SO2 related pollutant. Thus, we predict those 5 states should have the corresponding highest Fuel/fossil related consumption. We will consider coal and petroleum as the two key sources that contribute to the sulfur oxides.
The above maps show similar trend across states. TX, CA, LA, NY, PA, IL, LA, and FL are the states with top consumptions of coal and petroleum. This is consistent with our hypothesis that sulfur oxides concentration has a positive correlation with coal and petroleum consumption.
According to EPA, the main source of SO2 is the burning of fossil fuels by power plants and other industrial facilities. Referring to the coal consumption distribution, PA and OH have high coal consumption among American cities.
High SO2 in MD can be ascribed to the Crane and Wagner coal-fired power plants in Baltimore city. These two plants did not install modern emission controls for SO2. This acute problem in Baltimore boasted the highest SO2 emission in MD.
High SO2 emission in OH is because of urbanized areas such as Cleveland and Cincinnati and along Lake Erie and the Ohio River, where the predominant sources are electric generating units. In the report Ohio’s 2010 Revised Sulfur Dioxide National Ambient Air Quality Standard Recommended Designations and Nonattainment Boundaries,
Geographical or topographical barriers can significantly affect the transport of SO2 within its air shed. Emissions travelling from Ohio transport to the East Coast, driving up the amount of air pollution in states such as Maryland, Delaware, and New Jersey—also among the states with the highest emissions-related mortality rates.
What efforts were devoted to relieve air pollution effectively in these three states?
Interestingly, we found that these three states not only neighbor each other, but also share the same pattern of SO2 concentration. From year 2012, all three states have effective and apparent decreasing SO2 concentration. We question the reason behind such phenomenon. After some research, we found that EPA regulations and state environment policy contribute to the improvement of air quality. MD state applied Clean Cars program and introduced some of the strictest regulations on vehicles in the country in an effort to combat emissions. EPA approved Air Quality Implementation Plans in Delaware, District of Columbia, Maryland, Pennsylvania, Virginia, and West Virginia. Federal Mercury and Air Toxics (MATS) required coal and oil-fired plants to install pollution control equipment by April 15, 2015, and this policy explains the apparent SO2 dropping around year 2014. For more information, please refer to the link.
As mentioned above, geographical and weather factor such as upwind will affect the east coast. Thus, when air pollution in OH was enhanced, the SO4 concentration in east coast cities such as MD decreases as well.
Reflection about the research:
Stringent emission regulation can make to positive difference to air pollution because coal consumption and manufacture emission are the main reason of SO4.
Beyond that, we need to take weather and geography features into consideration. Basically, wind flow triggers pollutant transferring so limit the pollution of the upwind source can be effective approach to relieve air pollution of the destination cities.
Particles pollutant
Observing the the annually particle concentration fluctuation, we notice that unlike SO2, particles pollution has a predictable distribution, and the top 5 states with highest particle pollution are relatively stable. In the past 10 years, the total content of particles pollution does not decrease apparently. Besides, Florida (FL) and Texas (TA) both have quite high particle pollution, we take FL as example to explore the fluctuation of particles pollution.
Why FL has high particle pollution?
Referring to the coal consumption in 2016, we found that FL has a high coal consumption, which can be one of the reasons. Except for that, we guess the high metal concentration in air may due to the ocean in FL. The main metal elements in ocean, MG, K, NA, and CA mean transform to air. The third reason for high particle concentration is temperature. Mental elements belong to PM2.5, and according to Wiki, PM 2.5 increases with temperature increasing. FL and TX locate the south of USA, neighboring the ocean, having high coal consumption, so FL and TX have a very high particle concentration in air.
Why particles pollution has stable distribution?
According to Wiki, human-produced (fossil fuel) pollution is largely responsible for the areas of small aerosols over developed areas such as the eastern United States and Europe, especially in their summer. Besides the human factor, high temperature and geography, such as neighboring the ocean, will also contribute to high metal content in air. The unavoidable natural factor explains the stable distribution.
Pollution and health
Asthma
https://www.epa.gov/clean-air-act-overview/air-pollution-current-and-future-challenges
Research has shown that air pollution worse asthma symptoms. Short-term exposures to SO2 can harm the human respiratory system and make breathing difficult. Children, the elderly, and those who suffer from asthma are particularly sensitive to the effects of SO2. Hence we want to see if the distribution of asthma shows a match to our SO2 measurements over the country. As we have discovered that OH, PA, KY, and MD have high SO2 concentration starting 2008. Accordingly, we have seen a similar geographical distribution of a higher number of asthma patients on the map. However, the asthma distribution is also based on many other factors. For example, Oregon has a high asthma patient population but it has not shown any significant pollution trend. Through research, we have discovered that this is based on the age distribution across Oregon. Therefore, even though we can conclude that pollution contributes to the probability of having asthma but we cannot eliminate it as the only reason.
AsthmaData <- read.csv("Dataset/Asthma.csv")
library(choroplethr)
state_choropleth(AsthmaData, num_colors = 1, title = "US Adult Asthma Prevalence in 2015", legend = "% of all adults")
Cardionology
Scientific research has shown that NO3 and particles have a significant influence on cardiovascular-related disease. Based on the geographic map above we can see that the southeast region, part of the Midwest region and part of the northeast region show the highest heart attack death rate. From our geographic pattern it shows FL, TX, TN has the highest particle concentration, and IL, NJ, IN have the highest NO3 pollution value. However, because there are many other reasons could contribute to cardiovascular diseases, we can roughly conclude that NO3 and particles contribute to cardiovascular diseases but we cannot eliminate other potential reasons cause any heart diseases.
Heart<- data.frame(
read_csv('Dataset/Heart_Disease_2016.csv'))
state_choropleth(Heart, title="Heart Disease Date Rate by state for 2016", legend="Heart Disease Death Rate")+ scale_fill_brewer(palette=3)