Part I - Exploration of 2019 Wage Income in Minnesota

by Brian Allan Woodcock

July 2022

Outline

Introduction

Preliminary Wrangling

Univariate Exploration

Bivariate Exploration

Multivariate Exploration

Conclusions

Introduction

The aim of this exploratory investigation is to look at earned income in Minnesota, primarily wage income, in 2019 across different factors such as age, sex, marital status, race/ethnicity, and educational attainment.

The data was obtained from IPUMS ("Integrated Public Use Microdata Series") USA (https://usa.ipums.org/usa/index.shtml), a source for US Census microdata that provides easy access to the microdata along with documentation and harmonization of variables across time periods. (Microdata consists of individual records rather than "summary" or "aggregate" data.) The original source of the data is the US Census Bureau -- in particular, their American Community Survey (ACS), a yearly sample of the American population that began in 2000 for the purpose of obtaining more detailed information than what is available via the decennial census.

The IPUMS samples are cluster samples. Sampling occurs foremost on the basis of households or dwellings and only derivatively are individuals sampled as parts of households. The samples are also stratified -- i.e., they divide the population into strata (based on characteristics such as geography, household size, race, and group quarters membership) and then sample separately from each stratum. To protect individual confidentiality, geographic identifiers are restricted to the state level and certain individual variables, such as income variables, are top-coded. That is, for responses with a numeric value beyond a certain threshold, the actual reported value is not present in the dataset; rather, the state mean of all values above the threshold is used as a replacement. So, all top-coded variables are top-coded using the state mean of all cases greater than or equal to the state threshold value for that variable. In visualizations, the top-coded values usually stand out (as in some way anomalous) at the extreme high end of the spectrum of values.

The ACS oversamples areas with smaller populations. Each month a systematic sample is drawn to represent each U.S. county or county equivalent. The selected monthly sample is mailed the ACS survey at the beginning of the month. Nonrespondents are contacted via telephone for a computer assisted telephone interview (CATI) one month later. One third of the nonrespondents to the mail or telephone survey are contacted in person for a computer assisted personal interview (CAPI) one month following the CATI attempt. Weights for the household and person-level data adjust for the mixed geographic sampling rates, nonresponse adjustments, and individual sampling probabilities. (From: https://usa.ipums.org/usa/chapter2/chapter2.shtml)

It must be emphasized that the ACS yearly data samples are weighted samples since not every individual sample case represents the same number of people in the overall population because of the sample design which includes oversampling of areas with smaller populations. Each sample case represents anywhere from 20 to 1000 people in the complete population for the given year. The "weight" variables indicate how many persons in the population are represented by each sample case. So, the use of the weights is necessary in order for the sample to be (at least approximately) representative of the distribution of features within the overall U.S. population. That is, the use of the weights is necessary in order to produce statistical estimates -- i.e., proportions, means, medians, and ratios. Weights not only require special attention when analyzing the data, but they also create special challenges for producing some visualizations when the visualization function does not have an in-built parameter for adding weights.

The IPUMS USA website allows for the selection of samples (based on years) along with the selection of variables to create a data set for downloading. The variables are well documented and the data is fairly clean. Learning how to use the IPUMS resources takes some investment of time and effort, however, but there are helpful videos for getting started as well as useful documentation on the website. Two selection choices were made for obtaining data from the 2019 ACS sample. Minnesota was chosen, first, to narrow the scope of the investigation and thereby decrease the size of the dataset and, second, because this is the author's state of residence. Also, only persons 20 years and older were selected, since the aim is to investigate income of adults in the workforce.

References

Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 12.0 [accessed 5/11/2022]. Minneapolis, MN: IPUMS, 2022. https://doi.org/10.18128/D010.V12.0

U.S. Census Bureau. American Community Survey Operations Plan. Release 1: March 2003. https://usa.ipums.org/usa/resources/codebooks/ACS_codebook.pdf


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Preliminary Wrangling

We observe that none of the variables have null values. Variable values --- whether the variable is numerical or categorical --- are encoded numerically, either as integers or as floats.

What is the structure of your dataset?

The 2019 data has 43,143 rows. Weights, given by the variable perwt (for person weight as compared with household weight), can be used to obtain estimates that are representative of the population as a whole. There are 41 variables.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is earned income (incearn), especially wage income (incwage). These are both continuous, numerical variables as opposed to categorical values using numerical value encoding.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The aim of the investigation is foremost to explore wage income and differences in wage income across a variety of factors such as age (age), sex (sex), marital status (marst), race/ethnicity (rachsing), and educational attainment (educ). In some cases, such as race/ethnicity, there is a constellation of variables related to the general topic (some providing more detail than others or touching on a unique question). It would be unmanageable to investigate all of the variables in the constellation so a decision is needed in these cases as to just which ones to focus on. In some cases, too, feature engineering may be needed to obtain just the level of detail desired.

Besides the feature of interest and the factors by which to investigate that feature, there are some variables that need to be investigated because they play an enabling role for the investigation. Since the samples are weighted, the person weight variable (perwt) should be looked at so that it is well-understood. Also, employment status (empstat) and labor force involvement (labforce) should also be well-understood since it may be necessary to exclude certain people in the sample from the investigation because they are unemployed or consider themselves no longer a part of the labor force -- i.e., they don't fit the traditional profile of a wage worker.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Univariate Exploration

In this section, we will become familiar with -- and sometimes invent (by means of feature engineering) -- the variables for the investigation. How these variables then relate to one another will be explored further in the sections on bivariate and multivariate exploration.

Person Weight -- perwt

Person weight (perwt) indicates how many persons in the U.S. population are represented by a given person in an IPUMS sample. The weights are necessary in order to obtain nationally representative statistics.

Question: What does the distribution of person weights look like?

Observations

The distribution of person weights is clearly right-skewed. Most of the distribution is contained in the lower values. In order to observe the lower end better, we will display the log transform of the person weights along the x-axis.

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Age Distribution in 2019

Aim: Examine the age distribution of Minnesotans (20 years old or more) in 2019 using the age variable. In order for the 2019 ACS sample to be representative, the person weightings will be used. The use of the person weightings yields estimates that are the size of population totals (as though obtained through a census) even though they come from a sample. This should be kept in mind.

Question: What does the age distribution of Minnesotans (20 years old or more) in 2019 look like?

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Age Categories -- agecat

Aim: Create age categories for less fine-grained age visualization.

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Sex

Aim: Examine Minnesota population of persons 20 years old or more by sex using the sex variable.

Encoding Interpretation
1 Male
2 Female

Question: What is the 2019 sex distribution of persons 20 years old or more?

Observations

The interest in this variable concerns the question: Are there differences in income based on sex?


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Marital Status

Question: What does the marital status distribution of the 2019 population (20 years old or more) look like?

Aim: Examine the population distribution by marital status using the marst variable.

Encoding Interpretation
1 Married, spouse present
2 Married, spouse absent
3 Separated
4 Divorced
5 Widowed
6 Never married/single

Observations

The details of the bar chart are obvious. Over half were married with the spouse present. Slightly more than a quarter were "Never married/Single". Almost 12% were currently divorced. The interest in marital status is whether it correlates in any interesting way with income.

Any observed differences could then suggest hypotheses that might be further investigated.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Race/Ethnicity

Question: What does the 2019 distribution of the population (20 years old or more) by race/ethnicity look like?

Aim: Examine the distribution of the population by race using the variable rachsing.

Encoding Interpretation
1 White
2 Black/African American
3 American Indian/Alaska Native
4 Asian/Pacific Islander
5 Hispanic/Latino

The variable rachsing chosen here to represent race is a variable constructed from other variables by IPUMS USA. In the ACS, there are separate questions for race and ethnicity (i.e., Hispanic origin or not). This variable assigns people to a single race/ethnicity category. It provides a simplified race/ethnicity designation, using information from the survey variables race and hispan (Hispanic origin) as well as others.

All people who reported Hispanic origins are classified as Hispanic regardless of their race response. Non-Hispanic single-race respondents (American Indian/Alaska Native, Asian and/or Pacific Islander, Black/African American, and White) are classified according to their race response. Non-Hispanic people who reported “some other race” in combination with one of these race groups were classified as if they had not reported “some other race.” Non-Hispanic multiple-race people and non-Hispanic people who reported only “some other race” are assigned to a single category by means of an equation using the individual’s age, sex, region, and the urbanization level and racial diversity of their geographic region to predict which single race the person would have chosen if asked to choose only one. (For more information see: https://usa.ipums.org/usa-action/variables/RACHSING#description_section )

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Educational Attainment

Question: What does the distribution of the 2019 population by educational attainment look like?

Aim: Examine the distribution of the population by educational attainment.

The variable educ indicates respondents' educational attainment, as measured by the highest year of school or degree completed. Note that completion differs from the highest year of school attendance; for example, respondents who attended 10th grade but did not finish were classified as having completed 9th grade. (There is a more detailed version of this variable. Further investigation into educational attainment would be possible by using the detailed version.)

Encoding Interpretation
00 N/A or no schooling
01 Nursery school to grade 4
02 Grade 5, 6, 7, or 8
03 Grade 9
04 Grade 10
05 Grade 11
06 Grade 12
07 1 year of college
08 2 years of college
09 3 years of college
10 4 years of college
11 5+ years of college

We will reduce the number of levels of educational attainment by collapsing some of the lower levels, thereby creating another variable for educational attainment -- educat.

Notice that there are no counts for educ value 9, "3 years of college". It is unused.

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Employment

Question: Which employment classifications are useful for the investigation of wage income?

When looking at something like wage income, we may want to exclude people who are unemployed or not in the labor force. We may also want to distinguish between the self-employed and the wage workers. There are three variables concerning employment that may be useful to understand for setting those conditions.

empstat -- Employment Status

Encoding Interpretation
0 N/A
1 Employed
2 Unemployed
3 Not in labor force

labforce -- Labor Force Status

Encoding Interpretation
0 N/A
1 No, not in the labor force
2 Yes, in the labor force

classwkr -- Class of Worker

Encoding Interpretation
0 N/A
1 Self-employed
2 Works for wages

Notice that there are no empstat 0 cases. This means that there are no cases with employment status as 'N/A'. Notice also that there are no cases of 'Employed' but also 'N/A' for worker classification.

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Earned Income -- Wages and Business Income

There are several income variables, too many to examine in detail for this project. For this investigation, we will focus on earned income (incearn) in contrast to income from social security, welfare, investments, or other such sources. Earned income breaks down into two components: wage income (incwage) and business income (incbus00). To begin, we will focus on wage income.

The variable incwage reports each respondent's total pre-tax wage and salary income -- that is, money received as an employee -- for the previous year. Sources of income in incwage include wages, salaries, commissions, cash bonuses, tips, and other income received from an employer. Payments-in-kind or reimbursements for business expenses are not included. (From: https://usa.ipums.org/usa-action/variables/INCWAGE#description_section)

For the ACS, respondents are asked about their income in the past 12 months. However, since respondents are surveyed throughout the year, responses from Jan 2019 will describe income from 2018 whereas responses from Dec 2019 will describe income largely from 2019. Thus, the reference period for respondents in the same ACS year will not be the same. It may be possible to make some adjustments for this, but IPUMS USA has tested this and found that adjustment factors across adjacent years are too small to affect results appreciably. (For a lengthier discussion, see: https://usa.ipums.org/usa/acsincadj.shtml.)

Investigating Wage Income -- incwage

From above, we see that the variable incwage has values ranging from \$0 to \\$476,000. The latter is the top-coded value for this numerical variable.

We notice that several \$0 values occur. In what we see here, they occur for those "Not in Labor Force". Are there a lot of \\$0 values, even for other empclass values? We may want to exclude these from our analysis.

Also, non-zero incwage values occur in some cases for people who declared that they are "Unemployed" or "Not in Labor Force". There may be all kinds of explanations behind why these people who are "Unemployed" or "Not in Labor Force" have earned and reported wages, but clearly they do not fit the standard profile of the employed, wage-earner and this is foremost what we would like to investigate. So, we will focus on those cases whose empclass value is 'Employed & Works for Wages'.

We will create two masks then to use to filter the cases according to our interests for investigation.

We see two such cases of people who identified as employed, wage workers but listed \$0 wage earnings. Both were seniors and received a substantial portion of their total reported income from Social Security. We will treat these as anomalies. This observation justifies excluding these cases of people who reported \\$0 wage income from our analysis, an example of data cleaning.

Excluding cases of zero wage income, we will examine and compare the incwage distribution of employed, wage workers with those who reported some other employment and worker classification.

Observations

Both distributions are right-skewed. We see also that most of those who reported non-zero wage income self-identified as employed wage workers. On the other hand, those who self-identified in some other way but had non-zero wage income tended to report wage income figures toward the low end.

Investigating Earned Income (incearn) and Business Income (incbus00)

For the ACS, the variable incearn is the sum of the variables incwage and incbus00. That is, earned income is just the sum of wage income and business income. As with incwage, questions about income in the ACS concern the past 12 months.

The variable incbus00 reports each respondent's net pre-income-tax self-employment income from a business, professional practice, or farm, for the past 12 months. The figure is the amount earned after subtracting business expenses from gross receipts. It includes any money earned working for one's own concern(s). No distinction was made between incorporated and unincorporated businesses. (https://usa.ipums.org/usa-action/variables/INCBUS00#description_section)

As we saw, those who identified as employed sometimes identified their worker classification as "Self-Employed" as opposed to "Works for Wages". Let us create another mask for those people -- "Employed & Self-Employed". We might expect that those who identified as self-employed would report their earnings as business income, i.e. incbus00. Is that the case? Do many also earn wages?

We will create the following masks:

Observations

NOTE: These observations primarily concern the unweighted sample, not the population. However, inasmuch as the representativeness of the sample depends on weights being applied to unweighted cases -- and, hence, depends also on the quantity of unweighted cases -- the unweighted sample represents the population in a very rough-grained way.

From now on, when looking at wage and business income, we will be concerned to look at the non-zero responses on those variables.

Observations

Question

We have seen that people report wage income under a variety of circumstances, some circumstances suggesting that they do not fit the profile of a typical employed wageworker -- for example, someone who identifies as "Not in Labor Force" and yet reported wage income. This suggests that, in analyzing wage income, we should restrict our attention to only certain employment classifications. Clearly, those who indicated that they were employed and that they work for wages should be included -- namely, those cases with the value "Employed & Works for Wage" of the variable empclass -- but should others be included as well? In particular, should those who identified as employed and, moreover, as self-employed be included? We see above that a substantial proportion of those people report wage income and the magnitude of individual income reported is not trivial (in other words, it is not primarily low wages, which could be interpreted as supplementary to their business income, that is reported).

The category of those who identified as "Employed" (empstat value of 1) divides into those who identified as "Works for Wages" (classwkr value of 2) and those who identified as "Self-Employed" (classwkr value of 1).

Decision for Subsequent Analyses

For simplicity in the subsequent analyses, we must decide when looking at wage income whether to include those who identified as self-employed and thus examine the whole class of people who identified as "Employed" or merely those who identified as "Employed & Works for Wages". Regardless, we will continue to exclude cases that reported zero wage income. We have already seen that those who identified as self-employed often report substantial wage incomes. Moreover, we see here that including those people in our analysis does not alter the shape of the distribution of wage income appreciably. It simply adds more cases per bin, especially at the lower levels (say, below $110k). So, in the following analyses we will only restrict cases to those who indicated that they were "Employed" (and reported non-zero wage income) regardless of whether they self-identified as "Works for Wages" or as "Self-Employed".

Create DataFrame of Employed Wage Earners

For the following analyses, then, we will create a dataframe consisting only of people who self-identified as employed and with a non-zero wage income. We will call these people employed wage earners. We will also drop columns that we are unlikely to use going forward.

Note

This data frame is the basis for the upcoming analyses. So, for the subsequent analyses, the effective sample size is 25,597 cases. Weightings are applied to these cases to provide representative statistics.

Observations

Using a log scale, we can see the distribution of lower incomes better. We see that the main peak extends from about \$30,000 to \\$100,000.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations? Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Person weight (perwt) was the first variable explored since that is used in many of the subsequent visualizations in order to make the sample representative of the population. This variable was strongly right-skewed. In order to better visualize the distribution of the lower values (where most of the action was present), a log transform was used.

A second age variable was engineered from the age variable available from the IPUMS dataset. The IPUMS age variable is a numerical variable conveying the age in years of the respondent. A second, categorical, age variable (agecat) was created in order to consider age groups such as people in their 20s, 30s, 40s, etc.

It had been decided at the start to investigate earned income, and in particular income earned from wages as opposed to income from a business. How those income variables were related was explored. It became clear that people reported wage income who did not meet the standard image of an employee working for wages, for example, people who reported not being in the labor force. Variables on employment status, labor force status, and worker classification were investigated, and even engineered, to better understand and make a decision on how to constrain the cases of interest in looking at wage income.

It was decided that the income variable that will be the focus of interest in the subsequent analyses is wage income (incwage) and that only cases with an employment status of "Employed" and who reported a non-zero wage would qualify for inclusion in the domain of the analysis. Wage income (incwage) is right-skewed, as one might expect, and has a top-coded value of \$476,000. It will be explored using the following variables in the upcoming analyses:

Age distribution exhibits an interesting "divot" in the middle representing people in their mid-40s to mid-50s. This corresponds with people born during the Vietnam War, a possible explanation for the "divot".

The sex ratio (ratio of males to females) was found to be slightly less than one.

Roughly a quarter were never married/single. And a little more than half were married with the spouse present.

Whites made up a majority of the population in Minnesota in 2019. The minority populations of Blacks, Asians, and Hispanics were of roughly equal size. The American Indian population was surprisingly much smaller by comparison with the sizes of the other minorities.

Educational attainment in Minnesota was pretty good, albeit roughly 4-5\% had not attained at least a 12th grade education.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Bivariate Exploration

Create Expanded DataFrame in Place of Weights

Not all data visualization tools include parameters to add weights. We can get around this by creating an expanded dataframe that contains as many duplicates of a row as the weight for that row.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Wage Income by Age

Question: What is the distribution of wage income by age?

Observations

From this scatterplot, you can see the climb in higher wages at the lower ages (from, say, 20 to 30) and the decline in higher wages in the higher ages (say, 65 and onward). There is a lot of overplotting, no doubt; so the density of wages at age-wage locations in the map is not visible.

Observations

These heatmaps have the advantage over the previous scatterplot of making visible the density of people at various age-wage locations. One notices the bands of green running horizontally about every \$10k and which stand out prominently at \\$30k, \$40k, \\$50k, and \$60k. What also stands out are the bottom wages in dark blue at the bottom left corner that are earned by many people ages 20-22. Could these people be college students with part time employment and low paying jobs?

Observations

Observations

These frequency-scaled violinplots illustrate the removal of wage earners from the labor force starting with the 60s category and continuing with higher age categories. That the inner boxplots sink lower starting in the 60s shows that higher wage earners have left the labor force in greater proportion.

Transfer of People Out of the Labor Force

Question: What does the transfer of people out of the labor force at higher ages look like?

Observations

This graphic illustrates the transfer of people out of the labor force starting in the 60s age category. This transfer occurs alongside the overall decrease in people with increasing age due to mortality. It helps to explain the decrease in wage incomes at the higher age categories; people retire and leave the labor force. If those that retire early, or retire at all, are predominantly those in better financial shape, then that would leave mostly lower income wage earners in the labor force and it would explain the drop in wage income for the population at higher ages that was illustrated in previous graphics.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Wage Income by Sex

Question: How does the distribution of wage income vary with sex?

Observations

The previous three histogram visualizations are quite similar. The first two show count frequencies, the first using clustered bars/bins and the second using overlapping histograms. The third differs from the second in being a relative frequency histogram with the bin heights corresponding to the overall proportion in each group. The second and third visualizations are only slightly different because the quantities of males and females are so close. A relative frequency histogram is preferable for making comparisons about proportionate differences between the two groups.

What we see in all three cases is that females dominate males (by count or proportion) at lower incomes and males dominate females (by count or proportion) at higher incomes.

Observations

Although this is an interesting observation, it should not be interpreted as straightforward evidence of wage inequality. Other factors would need to be considered to make that judgement, such as comparing men versus women in the same occupation, the same job title, and doing equivalent work.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Wage Income by Marital Status

Question: Does the distribution of wage income vary by marital status?

Observations

Observations

These scaled violinplots with inner boxplots give a visual representation to the relative size of each marital category. Ordering by the median value helps one to see relative differences in wage income among the categories.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Wage Income by Race/Ethnicity

Question: Does the distribution of wage income vary by race/ethnicity?

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Wage Income by Educational Attainment

Question: Does the distribution of wage income vary by educational attainment?

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Distribution of Levels of Educational Attainment by Race/Ethnicity

Question: Could proportionate differences in educational attainment explain observed differences in average wages based on race/ethnicity?

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Distribution of Levels of Educational Attainment by Age Category Under 50

We showed a positive relationship between wage income and educational attainment above. Our background knowledge concerning the causes of salary increases and the demands for highly educated workers leads us to conclude that this positive relationship is indicative that educational attainment is a causal factor for higher wage income. Taking this as a given, we can investigate the following question.

Question: Do the observed increase in wages from the 20s to the 30s and then the 40s track with (and, therefore, may be due to) increasing degrees of educational attainment as people pass through these age categories?

Observations

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Wage income is the main feature of interest in this study. As mentioned above, in the process of investigating how wage incomes vary with other factors, new questions were raised about relationships between some of the other variables -- for example:

(a) How did the proportionate distribution among the levels of educational attainment change with age category for people under the age of 50?

(b) What is the proportionate distribution of the levels of educational attainment among the various categories of race/ethnicity?

For the first question, it was observed that there were proportionate increases for the categories of 2yr college and 5+yrs college from the 20s to the 30s. The proportions remained fairly similar from the 30s to the 40s. For the second question, we observed that Whites and Asians obtained roughly the same proportions for those achieving 4 years of college or more. And, Blacks, Hispanics, and American Indians obtained similar, albeit lower, proportions for the same (combined) measure of educational attainment.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Multivariate Exploration

Wage Income by Educational Attainment and Age Category under 50

Let us continue the investigation of whether the increase in wages from the 20s to the 30s and then the 40s is due to changes in educational attainment.

Question: Are there noticeable increases in average wage income with age across each level of educational attainment?

Observations


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

The Impact of Education on Wage Income across Racial/Ethnic Categories

Question: Is there a steady increase in average wage incomes with increasing educational attainment across each of the major racial/ethnic categories?

Observations

Observations

Observations on the Heatmap Tables


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Sex Differences in Wage Income

Question: Do sex differences in wage income exist regardless of factors such as age category, educational attainment, race/ethnicity, and marital status?

Let us look at wage income by sex and age category, sex and educational attainment, sex and race/ethnicity, and sex and marital status.

Observations

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

It was observed that average wage incomes did not necessarily increase with every increase in educational attainment for all racial/ethnic categories. In particular, there was a decrease in average wage incomes when comparing people with a 12th grade to those with one year of college in case of Asians, American Indians, and Hispanics.


| Go to TOP. | Go to Univariate Exploration. | Go to Bivariate Exploration. | Go to Multivariate Exploration.| |--|--|--|--|

Conclusions

This investigation has looked at wage income in Minnesota as reported in the 2019 ACS (American Community Survey) administered through the US Census Bureau. Since wage income was the focus, only people 20 years old or more were included. The data was obtained from IPUMS USA, an online source of US Census microdata that also provides documentation and harmonization of variables across time periods. The 2019 ACS attempts to be a 1% sample of the US population obtained using techniques of clustering and stratification. The result is a sample with weights in order for the sample to be representative. Some visualization tools include built-in parameters for including weights in order to make the data graphic representative of the population and some do not. For the latter case, an expanded data frame (having 2,746,113 rows) was created such that there were as many copies of a row from the original data set as corresponded with the weighting of that row in the original data set. This expanded data frame could then be fed into the visualization tool to produce a representative data graphic.