Team members responsible for this notebook:

Biying Li: **gathered and extracted data **
Tianyi Wu: ** gathered and extracted data **
Tiffany Wong: wrote markdown
Tim Yau: ** wrote markdown **

Data Gathering¶

In [2]:

%load_ext rmagic

Creating the Directory Structure¶

The following code creates the various directories that will store the data files used for analysis in this project:

In [1]:

%%bash
mkdir ../script ../data ../data/raw ../data/cleaned ../data/simulated ../visualizations
ls -r

README.md
NB4_project_report.ipynb
NB3_data_analysis.ipynb
NB2_data_cleaning.ipynb
NB1_data_gathering.ipynb

mkdir: cannot create directory `../script': File exists
mkdir: cannot create directory `../data': File exists
mkdir: cannot create directory `../data/raw': File exists
mkdir: cannot create directory `../data/cleaned': File exists
mkdir: cannot create directory `../data/simulated': File exists
mkdir: cannot create directory `../visualizations': File exists

Downloading and Displaying the Raw Data¶

The raw data was extracted from the Integrated Public Use Microdata Series (IPUMS) website which was referred to us by Professor Enrico Moretti. The URL from which we acquired the data is the following:

https://usa.ipums.org/usa/

The data was acquired in .dta format which is openable in Stata. However, the original, untrimmed data file contained approximately 600,000 entries and was too big to open in R (opening the file would crash R). Therefore, it was partially trimmed in Stata before we loaded it to R where it was further cleaned, the process of which will be discussed in the data cleaning notebook. The following image shows a sample of the original, uneditted data in Stata:

In [1]:

from IPython.display import Image
Image('../graphs/raw_data.png')

Out[1]:

To perform the initial trimming on the original data containing the massive ~600,000 entries, we ran Stata code that dropped row entries that did not contain an identifiable msa entry or an applicable ind1990 value, and then selected 25 specific industries to keep row entries for. After the trimming, we had approximately 62,000 row entries remaining, which we were able to export to R without R crashing. The following Stata code performed this initial trimming:

drop if msa == 0

(this drops all entries that have an unidentifiable msa value)

drop if ind1990 == 0

(this drops all entries that a non-applicable ind1990 value)

keep if ind1990 == 332 | ind1990 ==732 | ind1990 == 882 | ind1990 == 341 | ind1990 == 331 | ind1990 == 342 | ind1990 == 350 | ind1990 == 351 | ind1990 == 151 | ind1990 == 122 | ind1990 ==242 | ind1990 == 182 | ind1990 == 221 | ind1990 == 271 | ind1990 == 100 | ind1990 == 212 | ind1990 == 160 | ind1990 == 611 | ind1990 == 623 | ind1990 == 631 | ind1990 == 631 | ind1990 == 641 | ind1990 == 652 | ind1990 == 660 | ind1990 == 630 | ind1990 == 591

(this keeps all entries that belong to 25 different industries, all of which have a representative 3-digit code)

Now we import Pandas to create a data frame for the data.

In [2]:

import pandas as pd

We use Pandas to display a summary of the original raw data, list all existing column names and their corresponding number of entries.

In [3]:

pd.read_stata("../data/raw/data.dta")

Out[3]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 539212 entries, 0 to 539211
Data columns (total 5 columns):
year       539212  non-null values
msa        539212  non-null values
ind1990    539212  non-null values
edu        539212  non-null values
jobs       539212  non-null values
dtypes: float32(1), float64(1), int64(1), object(2)

Loading Raw Data into Data Frames¶

Since the data is already in data frame format in the Stata file, we directly read the .dta file into the IPython Notebook and print a sample of the data in the form of a partial data frame.

In [7]:

%%R
library(foreign)
IndData <- read.dta("../data/raw/industrydata.dta")
print(head(IndData))

  year         msa                              ind1990 edu jobs
1 1980 Abilene, TX                        Meat products   1  140
2 1980 Abilene, TX                        Meat products   2   60
3 1980 Abilene, TX                        Meat products   3   20
4 1980 Abilene, TX Apparel and accessories, except knit   1  120
5 1980 Abilene, TX Apparel and accessories, except knit   2   20
6 1980 Abilene, TX Apparel and accessories, except knit   3    0

In this data frame, we list for each row entry the year (3 possible values: 1980, 1990, and 2000), msa which represents the city and state, ind1990 which represents the type of the industry (it is 1990 because that was when the industry codes/names were established, the industry codes were arbitrarily assigned), the education level (1-4), and the number of jobs meeting all the conditions in that row entry. Again, the displayed data frame shows only the first 6 entries out of approximately 62,000 entries in the entire trimmed data file.