Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

Daniel Zezula: * cleaned data*
Biying Li: ** cleaned data **
Tiffany Wong: * helped with cleaning and wrote markdown*
Tim Yau: ** helped with cleaning wrote markdown **
Tianyi Wu: helped with cleaning and wrote markdown

In [4]:

%load_ext rmagic

The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

Project¶

After trimming the data in Stata, the set of data was small enough to open in R.

The number of people working in each industry has been separated into four different levels of education, but since education is not a relevant factor for our project, the first step we took in cleaning the data is to aggregrate the levels of education to produce a total number of people working in each industry.

To do this, we took the following steps:

We first deleted the column with the education levels by only creating a subset of the data that did not include the education column
After deleting that column, we were left with several rows that had the same information, so using the aggregate function, we combined those to give the total number of people employed in each industry

We first decided on "manufacturing" and "retail" as the industries that we wanted to analyze the tech spillover in. To do this, we went through the data and classified each job as either "manufacturing", "retail", "hightech", or irrelevant to our research ("bad data"). We created 4 level attributes: “Manufacturing”, “Retail”, “Hightech”, and “bad data”. We assigned the variables in “manu” with level “Manufacturing”, variables in “retail” with level “Retail”, variables in “hightech” with level “Hightech” and else with “bad data”.

Afterwards, we dropped the values with level = “bad data", and the data left is what we will be doing the regression analysis with.

In [5]:

%%R
library(foreign)
IndData = read.dta(paste(getwd(),'/../','data/raw/industrydata.dta',sep=''))
IndData <- subset(IndData, select=c("year", "msa", "ind1990", "jobs"))
edudata <- aggregate(cbind(IndData$jobs) ~ IndData$year + IndData$msa + IndData$ind1990, FUN = sum)
colnames(edudata) <- c("year", "msa", "ind1990", "jobs")
      
print(head(edudata))

manu = c("Meat products","Food industries, n.s.","Apparel and accessories, except knit","Pulp, paper, and paperboard mills","Soaps and cosmetics","Miscellaneous plastics products","Footwear, except rubber and plastic","Furniture and fixtures","Iron and steel foundries")
retail = c("Department stores","Food stores, n.e.c.","Apparel and accessory stores, except shoe","Shoe stores","Furniture and home furnishings stores","Eating and drinking places","Book and stationery stores","Jewelry stores")
hightech = c("Computers and related equipment","Machinery, except electrical, n.e.c.","Radio, TV, and communication equipment","Electrical machinery, equipment, and supplies, n.e.c.","Aircraft and parts","Computer and data processing services","Computer and data processing services","Engineering, architectural, and surveying services","Machinery, n.s.","Motor vehicles and motor vehicle equipment")

lev = levels(edudata$ind1990)

for (i in 1:length(lev)) {
  if (lev[i] %in% manu) {
    lev[i] <- "Manufacturing"
  }
  else if (lev[i] %in% retail) {
    lev[i] <- "Retail"
  }
  else if (lev[i] %in% hightech) {
    lev[i] <- "Hightech"
  }
  else {
    lev[i] <- "bad data"
  }
}
levels(edudata$ind1990) <- lev

dataclean = edudata[as.character(edudata$ind1990) != "bad data",]
data <- aggregate(cbind(dataclean$jobs) ~ dataclean$year + dataclean$msa + dataclean$ind1990, FUN = sum)
colnames(data) <- c("year", "msa", "ind1990", "jobs")

FinalData <- aggregate(cbind(data[,4]) ~ data[,1] + data[,2] + data[,3], FUN = sum)

colnames(FinalData) <- c("year", "place", "industry", "jobs")

saveRDS(FinalData, file=paste(getwd(), '/../', 'data/cleaned/FinalData.rda', sep=""))
print(head(FinalData))

  year         msa       ind1990 jobs
1 1980 Abilene, TX Meat products  220
2 1990 Abilene, TX Meat products  322
3 2000 Abilene, TX Meat products   48
4 1980   Akron, OH Meat products  140
5 1990   Akron, OH Meat products   65
6 2000   Akron, OH Meat products   84
  year       place      industry  jobs
1 1980 Abilene, TX Manufacturing   440
2 1990 Abilene, TX Manufacturing   740
3 2000 Abilene, TX Manufacturing   214
4 1980   Akron, OH Manufacturing 11560
5 1990   Akron, OH Manufacturing  7673
6 2000   Akron, OH Manufacturing 10566

The first output above is a sample of our data with the aggregated education levels. It shows the number of people employed for each job in a certain state for the years 1980, 1990, and 2000.

The second output shows a sample of our data after we've dropped all the irrelevant jobs and classified all the other jobs as either "manufacturing", "retail", or "hightech". It shows the total number of people employed in the manufacturing, retail, and hightech industries in a certain state in the years 1980, 1990, and 2000.