Team members responsible for this notebook:
Daniel Zezula: performed regression analysis
Biying Li: generated map
Tiffany Wong: helped with analysis and wrote markdown
Tim Yau: helped with analysis and wrote markdown
Tianyi Wu: helped with analysis and wrote markdown
Data Analysis¶
%load_ext rmagic
Subsetting the Data¶
Before we perform the data analysis and visualization, we must subset the data into 6 different categories, one for each combination of job industry (Hightech and Manufacturing, or Hightech and Retail) and year (1980, 1990, and 2000). This is so we may later run separate regressions on each specific combination of industry and year. We also create a new column that contains log values of the employment numbers so we may run log regressions if linear regression is inadequate.
%%R
FinalData <- readRDS(paste(getwd(),'/../','data/cleaned/FinalData.rda',sep=''))
Hightech1980 <- subset(FinalData, year == 1980 & industry == "Hightech")
Hightech1980["log_jobs"] <- NA
Hightech1980[,5] <- log(Hightech1980[,4])
Manufacturing1980 <- subset(FinalData, year == 1980 & industry == "Manufacturing")
Manufacturing1980["log_jobs"] <- NA
Manufacturing1980[,5] <- log(Manufacturing1980[,4])
Retail1980 <- subset(FinalData, year == 1980 & industry == "Retail")
Retail1980["log_jobs"] <- NA
Retail1980[,5] <- log(Retail1980[,4])
Hightech1990 <- subset(FinalData, year == 1990 & industry == "Hightech")
Hightech1990["log_jobs"] <- NA
Hightech1990[,5] <- log(Hightech1990[,4])
Manufacturing1990 <- subset(FinalData, year == 1990 & industry == "Manufacturing")
Manufacturing1990["log_jobs"] <- NA
Manufacturing1990[,5] <- log(Manufacturing1990[,4])
Retail1990 <- subset(FinalData, year == 1990 & industry == "Retail")
Retail1990["log_jobs"] <- NA
Retail1990[,5] <- log(Retail1990[,4])
Hightech2000 <- subset(FinalData, year == 2000 & industry == "Hightech")
Hightech2000["log_jobs"] <- NA
Hightech2000[,5] <- log(Hightech2000[,4])
Manufacturing2000 <- subset(FinalData, year == 2000 & industry == "Manufacturing")
Manufacturing2000["log_jobs"] <- NA
Manufacturing2000[,5] <- log(Manufacturing2000[,4])
Retail2000 <- subset(FinalData, year == 2000 & industry == "Retail")
Retail2000["log_jobs"] <- NA
Retail2000[,5] <- log(Retail2000[,4])
Then we run linear regressions on the 6 different subsets using lm. The result gives the slope/coefficient and the intercept of each linear regression.
%%R
RegM1980 <- lm(Hightech1980[,4] ~ Manufacturing1980[,4])
RegR1980 <- lm(Hightech1980[,4] ~ Retail1980[,4])
RegM1990 <- lm(Hightech1990[,4] ~ Manufacturing1990[,4])
RegR1990 <- lm(Hightech1990[,4] ~ Retail1990[,4])
RegM2000 <- lm(Hightech2000[,4] ~ Manufacturing2000[,4])
RegR2000 <- lm(Hightech2000[,4] ~ Retail2000[,4])
We use the default plot function to create scatter plots for each subset of the data, as well as run abline to create linear regression lines that run through the data points in the scatter plots. The slope and intercept of each regression line was given using the lm function. However, after making the scatter plots, we realize that outliers exist which create a considerable amount of spread in the data.
%%R
plot(Hightech1980[,4], Manufacturing1980[,4], xlab = "value of Hightech Employment", ylab = "value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1980")
abline(RegM1980)
Therefore, we consider running a log regression in place of the linear regression in order to eliminate outliers and thus minimize the spread of the data. This is done by running lm on the log-transformed values of the employment numbers.
%%R
LogRegM1980 <- lm(Hightech1980[,5] ~ Manufacturing1980[,5])
LogRegR1980 <- lm(Hightech1980[,5] ~ Retail1980[,5])
LogRegM1990 <- lm(Hightech1990[,5] ~ Manufacturing1990[,5])
LogRegR1990 <- lm(Hightech1990[,5] ~ Retail1990[,5])
LogRegM2000 <- lm(Hightech2000[,5] ~ Manufacturing2000[,5])
LogRegR2000 <- lm(Hightech2000[,5] ~ Retail2000[,5])
We then run the scatter plot function on the log-transformed data which creates new scatter plots that contain less spread. We also run abline again to create log regression lines to accurately represent the log regression, the values of which were given using lm.
%%R
plot(Hightech1980[,5], Manufacturing1980[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1980")
abline(LogRegM1980)
plot(Hightech1990[,5], Manufacturing1990[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1990")
abline(LogRegM1990)
plot(Hightech2000[,5], Manufacturing2000[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 2000")
abline(LogRegM2000)
plot(Hightech1980[,5], Retail1980[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 1980")
abline(LogRegR1980)
plot(Hightech1990[,5], Retail1990[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 1990")
abline(LogRegR1990)
plot(Hightech2000[,5], Retail2000[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 2000")
abline(LogRegR2000)
The following blocks of code give the respective statistical summaries of each log regression. After performing the log regressions, we observed that most of the R-square values are consistently above 0.5, which means that most of the regressions are strongly log-linear. This implies that there is a strong correlation between high-tech employment and employment in the dependent industries. Because the R-square values are fairly close to 1, this indicates that the fitted regression line is a close approximation to the actual values.
%%R
print(summary(LogRegM1980))
%%R
print(summary(LogRegM1990))
%%R
print(summary(LogRegM2000))
%%R
print(summary(LogRegR1980))
%%R
print(summary(LogRegR1990))
%%R
print(summary(LogRegR2000))
After obtaining the regression summaries and the scatter plots, we tried to produce another form of visualization to depict the regressions in a more visually appealing way. Therefore, we decided to create a color-coded map of the US that displays the regression coefficients for each state. To start off, we first had to reorganize the "place" labels of original data so that the labels would read only the state, instead of both the city and state. Afterwards we created code that loops through the list of the states and returned all the regression coefficients for the states and saved them into a vector. We ran the regression twice for manufacturing and retail and saved the results as the variables coeff and coeff1.
%%R
dataclean = readRDS(paste(getwd(),'/../','data/cleaned/clean2.rda',sep=''))
data <- aggregate(cbind(dataclean[,4]) ~ dataclean[,1] + dataclean[,2] + dataclean[,3], FUN = sum)
colnames(data) <- c("year", "place", "industry", "jobs")
states = c("alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut", "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan", "minnesota", "missouri", "mississippi", "montana", "nebraska", "nevada", "new hampshire", "new jersey", "new mexico", "new york", "north carolina", "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", "south dakota", "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west virginia", "wisconsin", "wyoming")
abrev = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")
lev = levels(data$place)
for (i in 1:length(lev)) {
tmp = as.character(lev[i])
lev[i] <- substr(tmp, nchar(tmp)-1, nchar(tmp))
}
for (i in 1:length(lev)) {
index = match(as.character(lev[i]), abrev)
lev[i] <- states[index]
}
levels(data$place) <- lev
coeff <- vector()
coeff1 <- vector()
for (i in 1:length(states)) {
tmpd = data[as.character(data$place) == states[i],]
tmpdh = tmpd[as.character(tmpd$industry) == "Hightech",]
tmpdm = tmpd[as.character(tmpd$industry) == "Manufacturing",]
result = tryCatch({
summ = summary(lm(tmpdh[,4] ~ tmpdm[,4]))
coef = summ$coefficients[2,1]
if (is.na(coef)) {
coef = 0
}
coeff[i] = coef
}, error = function(e){
coeff[i] = 0
})
}
coeff[is.na(coeff)] <- 0
for (i in 1:length(states)) {
tmpd = data[as.character(data$place) == states[i],]
tmpdh = tmpd[as.character(tmpd$industry) == "Hightech",]
tmpdr = tmpd[as.character(tmpd$industry) == "Retail",]
result = tryCatch({
summ = summary(lm(tmpdh[,4] ~ tmpdr[,4]))
coef1 = summ$coefficients[2,1]
if (is.na(coef1)) {
coef1 = 0
}
coeff1[i] = coef1
}, error = function(e){
coeff1[i] = 0
})
}
coeff1[is.na(coeff)] <- 0
%%R
print(coeff)
%%R
print(coeff1)
The following code installs the packages required to create the map for visualization purposes.
%%R
install.packages("maps")
install.packages("ggplot2")
install.packages("data.table")
require(maps)
require(ggplot2)
require(data.table)
We created a data frame with state names and regression coefficients for each state. Then we used the ggplot function to display the map and assign the values of the regression coefficients to each state on the plot, using color-coding.
%%R
us.state.map <- map_data('state')
head(us.state.map)
states <- levels(as.factor(us.state.map$region))
mapdata = data.frame(states, coeff)
df <- data.frame(region = states, value = mapdata[,2], stringsAsFactors = FALSE)
print(head(df))
map.data <- merge(us.state.map, df, by="region", all = T)
map.data <- map.data[order(map.data$order),]
print(ggplot(map.data, aes(x=long, y =lat, group=group, fill = value)) + geom_polygon(colour = "white"))
We repeated the step above for the retail industry so that we may generate a new map that displays the regression coefficients for each state based on retail data.
%%R
mapdata1 <- data.frame(states, coeff1)
df1 <- data.frame(region = states, value1 = mapdata1[,2], stringsAsFactors = FALSE)
print(head(df1))
map.data <- merge(us.state.map, df1, by="region", all = T)
map.data <- map.data[order(map.data$order),]
print(ggplot(map.data, aes(x=long, y =lat, group=group, fill = value1)) + geom_polygon(colour = "white"))
from IPython.core.display import Image
Image(filename=('../graphs/matlab_plots.jpg'))
from IPython.core.display import Image
Image(filename=('../graphs/8090H.jpg'))
Image(filename=('../graphs/8090M.jpg'))
Image(filename=('../graphs/8090R.jpg'))
Image(filename=('../graphs/9000H.jpg'))
Image(filename=('../graphs/9000M.jpg'))
Image(filename=('../graphs/9000R.jpg'))