Team members responsible for this notebook:
List the team members contributing to this notebook, along with their responsabilities:
Daniel Zezula: * cleaned data*
Biying Li: ** cleaned data **
Tiffany Wong: * helped with cleaning and wrote markdown*
Tim Yau: ** helped with cleaning wrote markdown **
Tianyi Wu: helped with cleaning and wrote markdown
%load_ext rmagic
Project¶
After trimming the data in Stata, the set of data was small enough to open in R.
The number of people working in each industry has been separated into four different levels of education, but since education is not a relevant factor for our project, the first step we took in cleaning the data is to aggregrate the levels of education to produce a total number of people working in each industry.
To do this, we took the following steps:
We first deleted the column with the education levels by only creating a subset of the data that did not include the education column
After deleting that column, we were left with several rows that had the same information, so using the aggregate function, we combined those to give the total number of people employed in each industry
We first decided on "manufacturing" and "retail" as the industries that we wanted to analyze the tech spillover in. To do this, we went through the data and classified each job as either "manufacturing", "retail", "hightech", or irrelevant to our research ("bad data"). We created 4 level attributes: “Manufacturing”, “Retail”, “Hightech”, and “bad data”. We assigned the variables in “manu” with level “Manufacturing”, variables in “retail” with level “Retail”, variables in “hightech” with level “Hightech” and else with “bad data”.
Afterwards, we dropped the values with level = “bad data", and the data left is what we will be doing the regression analysis with.
%%R
library(foreign)
IndData = read.dta(paste(getwd(),'/../','data/raw/industrydata.dta',sep=''))
IndData <- subset(IndData, select=c("year", "msa", "ind1990", "jobs"))
edudata <- aggregate(cbind(IndData$jobs) ~ IndData$year + IndData$msa + IndData$ind1990, FUN = sum)
colnames(edudata) <- c("year", "msa", "ind1990", "jobs")
print(head(edudata))
manu = c("Meat products","Food industries, n.s.","Apparel and accessories, except knit","Pulp, paper, and paperboard mills","Soaps and cosmetics","Miscellaneous plastics products","Footwear, except rubber and plastic","Furniture and fixtures","Iron and steel foundries")
retail = c("Department stores","Food stores, n.e.c.","Apparel and accessory stores, except shoe","Shoe stores","Furniture and home furnishings stores","Eating and drinking places","Book and stationery stores","Jewelry stores")
hightech = c("Computers and related equipment","Machinery, except electrical, n.e.c.","Radio, TV, and communication equipment","Electrical machinery, equipment, and supplies, n.e.c.","Aircraft and parts","Computer and data processing services","Computer and data processing services","Engineering, architectural, and surveying services","Machinery, n.s.","Motor vehicles and motor vehicle equipment")
lev = levels(edudata$ind1990)
for (i in 1:length(lev)) {
if (lev[i] %in% manu) {
lev[i] <- "Manufacturing"
}
else if (lev[i] %in% retail) {
lev[i] <- "Retail"
}
else if (lev[i] %in% hightech) {
lev[i] <- "Hightech"
}
else {
lev[i] <- "bad data"
}
}
levels(edudata$ind1990) <- lev
dataclean = edudata[as.character(edudata$ind1990) != "bad data",]
data <- aggregate(cbind(dataclean$jobs) ~ dataclean$year + dataclean$msa + dataclean$ind1990, FUN = sum)
colnames(data) <- c("year", "msa", "ind1990", "jobs")
FinalData <- aggregate(cbind(data[,4]) ~ data[,1] + data[,2] + data[,3], FUN = sum)
colnames(FinalData) <- c("year", "place", "industry", "jobs")
saveRDS(FinalData, file=paste(getwd(), '/../', 'data/cleaned/FinalData.rda', sep=""))
print(head(FinalData))
The first output above is a sample of our data with the aggregated education levels. It shows the number of people employed for each job in a certain state for the years 1980, 1990, and 2000.
The second output shows a sample of our data after we've dropped all the irrelevant jobs and classified all the other jobs as either "manufacturing", "retail", or "hightech". It shows the total number of people employed in the manufacturing, retail, and hightech industries in a certain state in the years 1980, 1990, and 2000.