Team members responsible for this notebook:

Daniel Zezula: performed regression analysis
Biying Li: generated map
Tiffany Wong: helped with analysis and wrote markdown
Tim Yau: helped with analysis and wrote markdown
Tianyi Wu: helped with analysis and wrote markdown

Data Analysis¶

In [1]:

%load_ext rmagic

Subsetting the Data¶

Before we perform the data analysis and visualization, we must subset the data into 6 different categories, one for each combination of job industry (Hightech and Manufacturing, or Hightech and Retail) and year (1980, 1990, and 2000). This is so we may later run separate regressions on each specific combination of industry and year. We also create a new column that contains log values of the employment numbers so we may run log regressions if linear regression is inadequate.

In [2]:

%%R
FinalData <- readRDS(paste(getwd(),'/../','data/cleaned/FinalData.rda',sep=''))

Hightech1980 <- subset(FinalData, year == 1980 & industry == "Hightech")
Hightech1980["log_jobs"] <- NA
Hightech1980[,5] <- log(Hightech1980[,4])

Manufacturing1980 <- subset(FinalData, year == 1980 & industry == "Manufacturing")
Manufacturing1980["log_jobs"] <- NA
Manufacturing1980[,5] <- log(Manufacturing1980[,4])

Retail1980 <- subset(FinalData, year == 1980 & industry == "Retail")
Retail1980["log_jobs"] <- NA
Retail1980[,5] <- log(Retail1980[,4])

Hightech1990 <- subset(FinalData, year == 1990 & industry == "Hightech")
Hightech1990["log_jobs"] <- NA
Hightech1990[,5] <- log(Hightech1990[,4])

Manufacturing1990 <- subset(FinalData, year == 1990 & industry == "Manufacturing")
Manufacturing1990["log_jobs"] <- NA
Manufacturing1990[,5] <- log(Manufacturing1990[,4])

Retail1990 <- subset(FinalData, year == 1990 & industry == "Retail")
Retail1990["log_jobs"] <- NA
Retail1990[,5] <- log(Retail1990[,4])

Hightech2000 <- subset(FinalData, year == 2000 & industry == "Hightech")
Hightech2000["log_jobs"] <- NA
Hightech2000[,5] <- log(Hightech2000[,4])

Manufacturing2000 <- subset(FinalData, year == 2000 & industry == "Manufacturing")
Manufacturing2000["log_jobs"] <- NA
Manufacturing2000[,5] <- log(Manufacturing2000[,4])

Retail2000 <- subset(FinalData, year == 2000 & industry == "Retail")
Retail2000["log_jobs"] <- NA
Retail2000[,5] <- log(Retail2000[,4])

Then we run linear regressions on the 6 different subsets using lm. The result gives the slope/coefficient and the intercept of each linear regression.

In [3]:

%%R
RegM1980 <- lm(Hightech1980[,4] ~ Manufacturing1980[,4])
RegR1980 <- lm(Hightech1980[,4] ~ Retail1980[,4])
RegM1990 <- lm(Hightech1990[,4] ~ Manufacturing1990[,4])
RegR1990 <- lm(Hightech1990[,4] ~ Retail1990[,4])
RegM2000 <- lm(Hightech2000[,4] ~ Manufacturing2000[,4])
RegR2000 <- lm(Hightech2000[,4] ~ Retail2000[,4])

We use the default plot function to create scatter plots for each subset of the data, as well as run abline to create linear regression lines that run through the data points in the scatter plots. The slope and intercept of each regression line was given using the lm function. However, after making the scatter plots, we realize that outliers exist which create a considerable amount of spread in the data.

In [4]:

%%R
plot(Hightech1980[,4], Manufacturing1980[,4], xlab = "value of Hightech Employment", ylab = "value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1980")
abline(RegM1980)

Therefore, we consider running a log regression in place of the linear regression in order to eliminate outliers and thus minimize the spread of the data. This is done by running lm on the log-transformed values of the employment numbers.

In [5]:

%%R
LogRegM1980 <- lm(Hightech1980[,5] ~ Manufacturing1980[,5])
LogRegR1980 <- lm(Hightech1980[,5] ~ Retail1980[,5])

LogRegM1990 <- lm(Hightech1990[,5] ~ Manufacturing1990[,5])
LogRegR1990 <- lm(Hightech1990[,5] ~ Retail1990[,5])

LogRegM2000 <- lm(Hightech2000[,5] ~ Manufacturing2000[,5])
LogRegR2000 <- lm(Hightech2000[,5] ~ Retail2000[,5])

We then run the scatter plot function on the log-transformed data which creates new scatter plots that contain less spread. We also run abline again to create log regression lines to accurately represent the log regression, the values of which were given using lm.

In [6]:

%%R
plot(Hightech1980[,5], Manufacturing1980[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1980")
abline(LogRegM1980)

plot(Hightech1990[,5], Manufacturing1990[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 1990")
abline(LogRegM1990)

plot(Hightech2000[,5], Manufacturing2000[,5], xlab = "log value of Hightech Employment", ylab = "log value of Manufacturing Employment", main = "Regression of Manufacturing on Hightech Employment in 2000")
abline(LogRegM2000)

plot(Hightech1980[,5], Retail1980[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 1980")
abline(LogRegR1980)

plot(Hightech1990[,5], Retail1990[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 1990")
abline(LogRegR1990)

plot(Hightech2000[,5], Retail2000[,5], xlab = "log value of Hightech Employment", ylab = "log value of Retail Employment", main = "Regression of Retail on Hightech Employment in 2000")
abline(LogRegR2000)

The following blocks of code give the respective statistical summaries of each log regression. After performing the log regressions, we observed that most of the R-square values are consistently above 0.5, which means that most of the regressions are strongly log-linear. This implies that there is a strong correlation between high-tech employment and employment in the dependent industries. Because the R-square values are fairly close to 1, this indicates that the fitted regression line is a close approximation to the actual values.

In [7]:

%%R
print(summary(LogRegM1980))


Call:
lm(formula = Hightech1980[, 5] ~ Manufacturing1980[, 5])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5169 -0.6736 -0.0113  0.6820  3.3689 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.08568    0.41165   5.067 7.82e-07 ***
Manufacturing1980[, 5]  0.80422    0.05071  15.860  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9802 on 253 degrees of freedom
Multiple R-squared:  0.4985,	Adjusted R-squared:  0.4966 
F-statistic: 251.5 on 1 and 253 DF,  p-value: < 2.2e-16

In [8]:

%%R
print(summary(LogRegM1990))


Call:
lm(formula = Hightech1990[, 5] ~ Manufacturing1990[, 5])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.16001 -0.61592  0.06086  0.60812  2.32615 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.33828    0.36225   6.455 5.69e-10 ***
Manufacturing1990[, 5]  0.79718    0.04529  17.602  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8654 on 247 degrees of freedom
Multiple R-squared:  0.5564,	Adjusted R-squared:  0.5546 
F-statistic: 309.8 on 1 and 247 DF,  p-value: < 2.2e-16

In [9]:

%%R
print(summary(LogRegM2000))


Call:
lm(formula = Hightech2000[, 5] ~ Manufacturing2000[, 5])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.00856 -0.51334  0.06724  0.54610  2.80840 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.21772    0.31625   7.013 1.74e-11 ***
Manufacturing2000[, 5]  0.84628    0.03941  21.471  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7927 on 281 degrees of freedom
Multiple R-squared:  0.6213,	Adjusted R-squared:   0.62 
F-statistic:   461 on 1 and 281 DF,  p-value: < 2.2e-16

In [10]:

%%R
print(summary(LogRegR1980))


Call:
lm(formula = Hightech1980[, 5] ~ Retail1980[, 5])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.71089 -0.58822 -0.07271  0.48806  2.25470 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -2.14847    0.43972  -4.886 1.83e-06 ***
Retail1980[, 5]  1.14459    0.04681  24.452  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7547 on 253 degrees of freedom
Multiple R-squared:  0.7027,	Adjusted R-squared:  0.7015 
F-statistic: 597.9 on 1 and 253 DF,  p-value: < 2.2e-16

In [11]:

%%R
print(summary(LogRegR1990))


Call:
lm(formula = Hightech1990[, 5] ~ Retail1990[, 5])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86898 -0.45283 -0.07724  0.38374  1.84674 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -2.08759    0.35769  -5.836 1.67e-08 ***
Retail1990[, 5]  1.12632    0.03734  30.166  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6003 on 247 degrees of freedom
Multiple R-squared:  0.7865,	Adjusted R-squared:  0.7857 
F-statistic:   910 on 1 and 247 DF,  p-value: < 2.2e-16

In [12]:

%%R
print(summary(LogRegR2000))


Call:
lm(formula = Hightech2000[, 5] ~ Retail2000[, 5])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47680 -0.37080 -0.03449  0.35968  2.17567 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -2.06591    0.33103  -6.241  1.6e-09 ***
Retail2000[, 5]  1.13998    0.03413  33.404  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5777 on 281 degrees of freedom
Multiple R-squared:  0.7988,	Adjusted R-squared:  0.7981 
F-statistic:  1116 on 1 and 281 DF,  p-value: < 2.2e-16

After obtaining the regression summaries and the scatter plots, we tried to produce another form of visualization to depict the regressions in a more visually appealing way. Therefore, we decided to create a color-coded map of the US that displays the regression coefficients for each state. To start off, we first had to reorganize the "place" labels of original data so that the labels would read only the state, instead of both the city and state. Afterwards we created code that loops through the list of the states and returned all the regression coefficients for the states and saved them into a vector. We ran the regression twice for manufacturing and retail and saved the results as the variables coeff and coeff1.

In [17]:

%%R
dataclean = readRDS(paste(getwd(),'/../','data/cleaned/clean2.rda',sep=''))

data <- aggregate(cbind(dataclean[,4]) ~ dataclean[,1] + dataclean[,2] + dataclean[,3], FUN = sum)

colnames(data) <- c("year", "place", "industry", "jobs")

states = c("alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut", "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan", "minnesota", "missouri", "mississippi", "montana", "nebraska", "nevada", "new hampshire", "new jersey", "new mexico", "new york", "north carolina", "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", "south dakota", "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west virginia", "wisconsin", "wyoming")
abrev = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")

lev = levels(data$place)

for (i in 1:length(lev)) {
  tmp = as.character(lev[i])
  lev[i] <- substr(tmp, nchar(tmp)-1, nchar(tmp))
}

for (i in 1:length(lev)) {
  index = match(as.character(lev[i]), abrev)
  lev[i] <- states[index]
}

levels(data$place) <- lev
coeff <- vector()
coeff1 <- vector()

for (i in 1:length(states)) {
  tmpd = data[as.character(data$place) == states[i],]
  tmpdh = tmpd[as.character(tmpd$industry) == "Hightech",]
  tmpdm = tmpd[as.character(tmpd$industry) == "Manufacturing",]
  result = tryCatch({
    summ =  summary(lm(tmpdh[,4] ~ tmpdm[,4]))
    coef = summ$coefficients[2,1]
    if (is.na(coef)) {
      coef = 0
     }
    coeff[i] = coef
}, error = function(e){
     coeff[i] = 0
   })
}
coeff[is.na(coeff)] <- 0


for (i in 1:length(states)) {
  tmpd = data[as.character(data$place) == states[i],]
  tmpdh = tmpd[as.character(tmpd$industry) == "Hightech",]
  tmpdr = tmpd[as.character(tmpd$industry) == "Retail",]
  result = tryCatch({
    summ =  summary(lm(tmpdh[,4] ~ tmpdr[,4]))
    coef1 = summ$coefficients[2,1]
    if (is.na(coef1)) {
      coef1 = 0
     }
    coeff1[i] = coef1
}, error = function(e){
     coeff1[i] = 0
   })
}
coeff1[is.na(coeff)] <- 0

In [18]:

%%R
print(coeff)

 [1]   0.7115698 -33.3277946   5.2028555   0.6179968   1.4091236   3.7505380
 [7]   4.0481171   0.0000000   0.8501121   3.5056637  -1.3427864   4.9685746
[13]   2.7673488   1.6425278   1.0953302   3.6375939  -0.3966061   2.2442435
[19]  -2.3782027   2.7932623   0.5867178   5.5326189   3.6334368   1.1383603
[25]   0.5982388  -0.9978697  -1.5984237   3.2264123   4.4290987   1.1664715
[31]   4.7783107   4.1818984   0.4620826   0.0000000   3.4752140   1.4214865
[37]   2.6611734   1.0382671   1.0254160   2.0751625  -5.2142857   0.5478073
[43]   3.8172396   3.6889752   0.0000000   5.6703849   3.3149219   0.0000000
[49]   2.0943237

In [19]:

%%R
print(coeff1)

 [1] 0.4185076 0.1175529 0.8385614 0.4382308 0.6172307 0.6180283 0.6458837
 [8]        NA 0.3764491 0.6714142 0.5649119 1.2376178 0.7354974 0.6844986
[15] 0.3980425 0.7260706 0.6534292 0.2252501 0.5521183 0.6087663 0.4467317
[22] 1.8388163 0.7190305 0.3670451 0.4633465 0.3972635 0.6216581 0.1872408
[29] 0.9429693 0.4899972 0.5546215 0.8722400 0.7799667        NA 0.7778213
[36] 0.5466315 0.6902064 0.4982729 0.5524925 0.8028175 0.7849462 0.6408214
[43] 0.7516520 0.5996115        NA 0.7877284 0.7626581        NA 0.9892499

The following code installs the packages required to create the map for visualization purposes.

In [20]:

%%R
install.packages("maps")
install.packages("ggplot2")
install.packages("data.table")

require(maps)
require(ggplot2)
require(data.table)

Installing package into ‘/home/oski/R/i686-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://cran.cnr.Berkeley.edu/src/contrib/maps_2.3-6.tar.gz'
Content type 'application/x-gzip' length 1430982 bytes (1.4 Mb)
opened URL
==================================================
downloaded 1.4 Mb


The downloaded source packages are in
	‘/tmp/RtmpdJoceA/downloaded_packages’
Installing package into ‘/home/oski/R/i686-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.Berkeley.edu/src/contrib/ggplot2_0.9.3.1.tar.gz'
Content type 'application/x-gzip' length 2330942 bytes (2.2 Mb)
opened URL
==================================================
downloaded 2.2 Mb


The downloaded source packages are in
	‘/tmp/RtmpdJoceA/downloaded_packages’
Installing package into ‘/home/oski/R/i686-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.Berkeley.edu/src/contrib/data.table_1.9.2.tar.gz'
Content type 'application/x-gzip' length 1021871 bytes (997 Kb)
opened URL
==================================================
downloaded 997 Kb


The downloaded source packages are in
	‘/tmp/RtmpdJoceA/downloaded_packages’
Loading required package: maps
Loading required package: ggplot2
Loading required package: data.table
data.table 1.9.2  For help type: help("data.table")

We created a data frame with state names and regression coefficients for each state. Then we used the ggplot function to display the map and assign the values of the regression coefficients to each state on the plot, using color-coding.

In [1]:

%%R
us.state.map <- map_data('state')
head(us.state.map)

states <- levels(as.factor(us.state.map$region))

mapdata = data.frame(states, coeff)

df <- data.frame(region = states, value = mapdata[,2], stringsAsFactors = FALSE)
print(head(df))

map.data <- merge(us.state.map, df, by="region", all = T)
map.data <- map.data[order(map.data$order),]

print(ggplot(map.data, aes(x=long, y =lat, group=group, fill = value)) + geom_polygon(colour = "white"))

ERROR: Cell magic `%%R` not found.

We repeated the step above for the retail industry so that we may generate a new map that displays the regression coefficients for each state based on retail data.

In [22]:

%%R
mapdata1 <- data.frame(states, coeff1)

df1 <- data.frame(region = states, value1 = mapdata1[,2], stringsAsFactors = FALSE)
print(head(df1))

map.data <- merge(us.state.map, df1, by="region", all = T)
map.data <- map.data[order(map.data$order),]

print(ggplot(map.data, aes(x=long, y =lat, group=group, fill = value1)) + geom_polygon(colour = "white"))

       region    value1
1     alabama 0.4185076
2     arizona 0.1175529
3    arkansas 0.8385614
4  california 0.4382308
5    colorado 0.6172307
6 connecticut 0.6180283

Besides the analysis above, we also used MATLAB to generate graphs showing percentage changes in employment of each city. As we have data for three industries in three years, we have 6 plot graphs representing the trends of the changes. The MATLAB code and graphs are shown as screenshots.

In [1]:

from IPython.core.display import Image
Image(filename=('../graphs/matlab_plots.jpg'))

Out[1]:

In [10]:

from IPython.core.display import Image
Image(filename=('../graphs/8090H.jpg'))

Out[10]:

In [11]:

Image(filename=('../graphs/8090M.jpg'))

Out[11]:

In [12]:

Image(filename=('../graphs/8090R.jpg'))

Out[12]:

In [13]:

Image(filename=('../graphs/9000H.jpg'))

Out[13]:

In [14]:

Image(filename=('../graphs/9000M.jpg'))

Out[14]:

In [2]:

Image(filename=('../graphs/9000R.jpg'))

Out[2]:

In []: