Introduction

The economic well-being of individuals is reliant on their income, where income is defined as the money an individual (or household) receives on a regular basis. In the United States, the Census Bureau uses income (money received before expenses and deductions) to gauge the population’s range of poverty, wealth, and financial security (United States Census Bureau, 2016). There are a variety of factors that can influence one’s income, including socioeconomic drivers, education and vocation. This project examines some of the variables that are often related to income.

Description of Dataset

This project works with a dataset of adult incomes obtained from the University of California Irvine (UCI) Machine Learning Repository. The data was donated by Ronny Kohavi and Barry Becker (Silicon Graphics) and was originally extracted by Barry Becker from the 1994 Census database and used for machine learning predictions of whether a person makes over $50,000 per year based on personal factors.

This 1994 income census dataset consists of multivariate categorical and integer data that describe socioeconomic and personal classifiers of adults across the USA. Each instance (32,561) is an individual whose annual income was grouped as either above or below $50,000. Table 1 shows an overview of the 15 attributes (variables), including whether each is categorical or integer and a brief interpretation of the variable.



A couple of assumptions were made about these data based on information on the Census website. It is assumed that “capital gains” indicate non-cash financial benefits (e.g., food stamps, health benefits, subsidized housing or transportation, employer contributions to retirement programs, medical and educational expenses, etc.), and that “capital losses” include non-cash expenses (such as depreciated value of assets). We are also assuming that “education number” indicates the number of years allotted to education.

It is of note that these data are from 1994 census, and the income threshold of $50,000 held a different meaning for wealth than it holds today. As this dataset includes socioeconomic attributes, it’s worth noting that US-born white males comprise the majority of the data instances.

Exploratory Data Analysis

Load data

# read csv file
dat <- read.csv(here("data/downloaded_datafile"), header = F)

# rename columns
names(dat) <- c("age", "workclass", "fnlwgt", "education", "education-num", "martial_status", 
                "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", 
                "hours-per-week", "native-country", "label")

Summary overview

summary(dat)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education-num                  martial_status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship                    race      
##   Prof-specialty :4140    Husband       :13193    Amer-Indian-Eskimo:  311  
##   Craft-repair   :4099    Not-in-family : 8305    Asian-Pac-Islander: 1039  
##   Exec-managerial:4066    Other-relative:  981    Black             : 3124  
##   Adm-clerical   :3770    Own-child     : 5068    Other             :  271  
##   Sales          :3650    Unmarried     : 3446    White             :27816  
##   Other-service  :3295    Wife          : 1568                              
##  (Other)         :9541                                                      
##       sex         capital-gain    capital-loss    hours-per-week 
##   Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##   Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                  Median :    0   Median :   0.0   Median :40.00  
##                  Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                  3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                  Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                  
##         native-country     label      
##   United-States:29170    <=50K:24720  
##   Mexico       :  643    >50K : 7841  
##   ?            :  583                 
##   Philippines  :  198                 
##   Germany      :  137                 
##   Canada       :  121                 
##  (Other)       : 1709

The summary overview provides a snapshot of the data spread and averages. From this, we see this dataset includes a disproportionate number of middle-age, white, US-born, private-sector employees. There appears to be a fairly even distribution of individuals across occupational sectors and the majority of individuals work approximately 40 hours per week.

Clean data

The summary showed that there are many zero values for capital gains and losses. Because the income in this dataset is binary (above or below $50K) the capital gains and losses appear to be a more interesting metric in gauging wealth for the individuals in the Census. We will filter the data to include only instances when there was a non-zero vlaue for capital gains or losses.

# remove rows that contain zeroes for both capital gain and loss and merge capital-gain and capital-loss into a single variable, net
dat.filt <- dat %>% 
  filter(`capital-gain` != `capital-loss`) %>% 
  mutate(net = if_else(`capital-gain` == 0, 
                       as.numeric(`capital-loss`)*-1, # transform capital-loss to negative values 
                       as.numeric(`capital-gain`)))

# remove leading white spaces
dat.filt$race <- trimws(dat.filt$race)

# convert race to a factor
dat.filt$race <- factor(dat.filt$race, c("Other", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "White", "Black"))

Filtered summary overview

Here we can view the filtered data summary and see that by filtering by capital gains and losses, the demographic has shifted to slightly older individuals represented by more men than women.

summary(dat.filt)
##       age                    workclass        fnlwgt       
##  Min.   :17.00    Private         :2714   Min.   :  19302  
##  1st Qu.:34.00    Self-emp-not-inc: 413   1st Qu.: 118346  
##  Median :42.00    Local-gov       : 321   Median : 175669  
##  Mean   :43.18    Self-emp-inc    : 284   Mean   : 187152  
##  3rd Qu.:51.00    ?               : 181   3rd Qu.: 234292  
##  Max.   :90.00    State-gov       : 164   Max.   :1033222  
##                  (Other)          : 154                    
##          education    education-num                  martial_status
##   HS-grad     :1086   Min.   : 1.00    Divorced             : 453  
##   Bachelors   : 971   1st Qu.: 9.00    Married-AF-spouse    :   2  
##   Some-college: 758   Median :10.00    Married-civ-spouse   :2777  
##   Masters     : 423   Mean   :11.03    Married-spouse-absent:  35  
##   Prof-school : 213   3rd Qu.:13.00    Never-married        : 769  
##   Assoc-voc   : 188   Max.   :16.00    Separated            :  81  
##  (Other)      : 592                    Widowed              : 114  
##               occupation           relationship                  race     
##   Prof-specialty   :850    Husband       :2454   Other             :  23  
##   Exec-managerial  :847    Not-in-family : 878   Asian-Pac-Islander: 137  
##   Sales            :512    Other-relative:  71   Amer-Indian-Eskimo:  31  
##   Craft-repair     :506    Own-child     : 258   White             :3755  
##   Adm-clerical     :362    Unmarried     : 274   Black             : 285  
##   Machine-op-inspct:196    Wife          : 296                            
##  (Other)           :958                                                   
##       sex        capital-gain    capital-loss    hours-per-week 
##   Female: 992   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##   Male  :3239   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median : 3137   Median :   0.0   Median :40.00  
##                 Mean   : 8293   Mean   : 671.9   Mean   :43.42  
##                 3rd Qu.: 7688   3rd Qu.:1740.0   3rd Qu.:50.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##         native-country    label           net       
##   United-States:3850    <=50K:1781   Min.   :-4356  
##   ?            :  90    >50K :2450   1st Qu.:-1740  
##   Mexico       :  31                 Median : 3137  
##   Philippines  :  24                 Mean   : 7622  
##   India        :  21                 3rd Qu.: 7688  
##   Germany      :  20                 Max.   :99999  
##  (Other)       : 195

Relationship between education attainment and annual net gain

# generate boxplot of annual net gain across education levels
dat.filt %>% 
  ggplot(aes(x = education, y = net)) +
  geom_boxplot() +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw(12) +
  labs(x = "Education attainment level",
       y = "Annual net gain",
       title = "Relationship between education attainment and annual net gain")

From the above boxplot, there seems to be minimal correlation between annual net gain and education attainment, however there seems to be a greater spread in annual net gain for individuals with at least a high school diploma. Professional school education demonstrated the highest median in annual net gain.

Relationship between race, gender and annual net gain

# generate violin plots of annual net gain across race and gender
dat.filt %>% 
  ggplot(aes(x = race,
             y = net, fill = sex)) +
  geom_violin() +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw(12) +
  labs(x = "Ethnicity",
       y = "Annual net gain",
       title = "Relationship between race, gender, and annual net gain")

From the above violin plot, there doesn’t appear to be any significant differences in annual net gain between sex across all ethnic groups. Moreover, no obvious correlation between ethncity and annual net gain can be observed.

Correlation between work hours per week and annual net gain

# generate a box plot of annual net gain across work hours
dat.filt %>% 
  mutate(`work hours` = factor(case_when(`hours-per-week` <= 25 ~ "Short", # define a new variable to bin work hours per week into 4 categories
                                  `hours-per-week` > 25 & `hours-per-week` <= 50 ~ "Medium",
                                  `hours-per-week` > 50 & `hours-per-week` <= 75 ~ "Long",
                                  TRUE ~ "Very Long"),
                               levels = c("Short", "Medium", "Long", "Very Long"))) %>% 
  ggplot(aes(x = `work hours`, y = net)) +
  geom_boxplot() +
  theme_bw(12) +
  guides(fill = F) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(x = "Work Hours",
       y = "Annual net gain",
       title = "Relationship between work hours per week and annual net gain")

From the above boxplot, there appears to be an increase in annual net gain from short to long work hours. However, the differences may not be significant because greater variance in annual net gain is observed for individuals with long work hours.

Research Questions

In this study, we will explore the relationships between personal attributes and quantitative income-related variables with the goal of identifying relationships and interesting patterns. Specifically, we will focus on addressing the following exploratory research questions:

  1. Is there an observable relationship between personal attributes data and income level?
  2. Does the number of hours worked per week relate more to occupation, sex, race, age, or is there no clear relationship?
  3. What is the relationship between education and hours worked per week (e.g. does a person work fewer hours if they have completed more schooling)?

Plan of Action

The variables that effect income may be confounding and are unlikely to be direct, therefore these data may not be appropriate for linear regression analyses. We will focus on exploring the relationships variables and identifying relationships and patterns.

References

United States Census Bureau, 2016. Income and Poverty, ‘about income’. https://www.census.gov/topics/income-poverty/income/about.html

University of California Irvine, Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/adult.