Exploratory Data Analysis of 2013 Behavioral Risk Factor Data

Note: Created for the Coursera course “Introduction to Probability and Data” in the Statistics with R sequence by Duke University.

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

What is the Behavioral Risk Factor Surveillance System (BRFSS)? It is a collaborative project between the Centers for Disease Control and Prevention (CDC) and all of the states in the United States (US) as well as participating US territories.

What is the objective of the BRFSS? The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 included tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

How was the BRFSS 2013 data collected? Both landline telephone- and cellular telephone-based surveys were conducted. In conducting the BRFSS landline telephone survey, interviewers collected data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collected data from an adult who participated by using a cellular telephone and resided in a private residence or college housing. In 2012, an estimated 97.5% of US households had telephone service with telephone coverage ranging from 95.3% in New Mexico to 98.6% in Connecticut. The percentage of cellular-only households was 39.4%. Given that this is a telephone based survey, one cause for concern is obviously the percent of each population with non-telephone coverage.

Random sampling was used to collect the data. In some cases, simple random sampling was used (e.g., Guam and Puerto Rico). More often stratified sampling was used in which random sampling occurred within specified strata. In order to address non-coverage as well as underrepresentation, a weighting process was used which involved two aspects – design weighting and a method called “raking”. So, weighting was used to address bias and shortcomings from sampling. Unweighted data make the unwarranted assumption that each record has equal probability of being selected. It is recommended that the final weight be used in the analysis of the data in order to make generalizations from the sample to the population.

The BRFSS project is purely concerned with sampling. So, although generalizations can be applied to populations under the restriction that the weighting recommendation be followed, conclusions about causality are not licensed. No experimentation with random assignment was used which would be necessary in order to justify causal inferences.

Information obtained from: https://www.cdc.gov/brfss/annual_data/2013/pdf/Overview_2013.pdf

Part 2: Research questions

Research quesion 1:
Are unemployed females aged 18-64 more likely to have health care coverage than unemployed males? Health care coverage has been a topic of social significance in the U.S. for the past decade. The significance of this question is largely to see if there is a difference based on sex for those who are unemployed and below the usual retirement age.

Research quesion 2:
What is the difference between the mean number of days physical health was not good over the past 30 days as reported by those who said they were in excellent general health versus those who said they were in poor general health? Health and the perception of it is of universal significance. The purpose of this question is to investigate the relationship between respondents’ general answer as to the state of their health and the more detailed question regarding how their physical health had been over the previous 30 days. In other words, how does a general answer as to health relate to more specific information regarding physical health? One would expected them to be positively associated.

Research quesion 3: What are the top three auxiliary forms of exercise reported by female dancers? By male dancers? How do the lists compare? This inquiry is of personal interest. The author of this study enjoys dancing (Swing, Salsa, Blues dancing) and was curious about alternative forms of exercise employed by other dancers.

Part 3: Exploratory data analysis

Research quesion 1: Are unemployed females aged 18-64 more likely to have health care coverage than unemployed males? Three variables from the brfss2013 data set are used for this analysis: employment status (employ1), respondents sex (sex), and health care coverage respondents aged 18-64 (X_hcvu651). The first step in the analysis involves creating a new variable regarding employment status with binary “Yes”/“No” values. The variable in the original data set about employment included more information about employment than necessary for the purposes of the analysis. As shown in the following table.

brfss2013 %>%
  filter(!is.na(employ1)) %>%
  group_by(employ1) %>%
  summarise(count = n())

## # A tibble: 8 × 2
##   employ1                           count
##   <fct>                             <int>
## 1 Employed for wages               202200
## 2 Self-employed                     39832
## 3 Out of work for 1 year or more    14074
## 4 Out of work for less than 1 year  12242
## 5 A homemaker                       31647
## 6 A student                         12682
## 7 Retired                          138259
## 8 Unable to work                    37453

For the purpose of this analysis, NAs were filtered out and from the remaining options a person was interpreted to have indicated they were employed by the following responses: “Employed for wages” and “Self-employed”. These two responses were interpreted as a “Yes” to the question “Are you currently employed?” for the creation of the new variable employ_status. All other responses as shown in the above table (except NAs which had been filtered out) were interpreted as a “No” answer to the question of employment status for the new variable employ_status.

brfss2013 <- brfss2013 %>%
  mutate(employ_status = ifelse(employ1 == "Employed for wages" | employ1 == "Self-employed", "Yes", "No"))
brfss2013 %>%
  filter(!is.na(employ1)) %>%
  group_by(employ_status) %>%
  summarise(count = n())

## # A tibble: 2 × 2
##   employ_status  count
##   <chr>          <int>
## 1 No            246357
## 2 Yes           242032

The second step of the analysis involved creating a data set focused on female respondents whose employment status is interpreted as unemployed by the methodology explained above. The variable X_hcvu651 from brfss2013 contained information on those aged 18-24 who indicated either that they do or don’t have health care coverage. NAs were filtered out since among those were people outside of the targeted age range.

f_unempl_hc <- brfss2013 %>%
  select(sex, employ_status, X_hcvu651) %>%
  filter(sex == "Female", employ_status == "No", !is.na(X_hcvu651))
f_unempl_hc %>%
  group_by(X_hcvu651) %>%
  summarise(count = n())

## # A tibble: 2 × 2
##   X_hcvu651                        count
##   <fct>                            <int>
## 1 Have health care coverage        56795
## 2 Do not have health care coverage 13736

From this table it is easy to calculate the percentage of unemployed females with health care coverage. The answer is 56795/70531 = 0.8052488, or 80.5%. This can be illustrated in a bar plot.

ggplot(data = f_unempl_hc, aes(x = X_hcvu651)) +
  geom_bar(aes(y = (..count..)/sum(..count..)))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

The third step in the analysis involved the creation of a data set focused on unemployed males, following the design of the one for the females.

m_unempl_hc <- brfss2013 %>%
  select(sex, employ_status, X_hcvu651) %>%
  filter(sex == "Male", employ_status == "No", !is.na(X_hcvu651))
m_unempl_hc %>%
  group_by(X_hcvu651) %>%
  summarise(count = n())

## # A tibble: 2 × 2
##   X_hcvu651                        count
##   <fct>                            <int>
## 1 Have health care coverage        28655
## 2 Do not have health care coverage  8273

The calculated percentage of unemployed males with health care coverage came out to be 28655/36928 = 0.7759695, or 77.6%. This can be visualized via a bar plot.

ggplot(data = m_unempl_hc, aes(x = X_hcvu651)) +
  geom_bar(aes(y = (..count..)/sum(..count..)))

The end result is that 80.5% of unemployed females reported having health care coverage as compared with 77.6% of males. These percentages are not remarkably different (a difference of 2.9%). This suggests that the answer to the original research question is that, for ages 18-64, unemployed females are only slightly more likely to have health care coverage than unemployed males. This is a preliminary analysis, however. One might be concerned with how the “unemployed” category was determined from the data given that it consists of a disjunction of various responses. One might also be concerned about the legitimacy of filtering out NAs in the analysis.

Research quesion 2: What is the difference between the mean number of days physical health was not good over the past 30 days as reported by those who said they were in excellent general health versus those who said they were in poor general health? This analysis involves two variables: general health (genhlth) and number of days physical health not good (physhlth).

The first step in the analysis was to create a data set focused on those who claimed their general health is excellent. Among these respondents, data about the number of days of bad physical health among the past 30 days was plotted in a bar plot. The result is right-skewed with a single large peak at 0.

genhlth_excellent <- brfss2013 %>%
  select(genhlth, physhlth) %>%
  filter(genhlth == "Excellent", !is.na(physhlth))
ggplot(data = genhlth_excellent, aes(x = as.factor(physhlth))) +
  geom_bar()

Moreover, statistics were calculated. The median number of days of bad health (among the previous 30) turned out to be 0. The mean was about 0.89, a number close to 0.

genhlth_excellent %>%
  summarise(e_mean = mean(physhlth), e_median = median(physhlth), e_sd = sd(physhlth),
            e_min = min(physhlth), e_max = max(physhlth))

##      e_mean e_median     e_sd e_min e_max
## 1 0.8868884        0 3.562395     0    30

Step two of the analysis involved focusing on those who claimed to be in poor general health. A plot of number of days of bad physical health among the past 30 days shows a left-skewed distribution with a large peak at 30.

genhlth_poor <- brfss2013 %>%
  select(genhlth, physhlth) %>%
  filter(genhlth == "Poor", !is.na(physhlth))
ggplot(data = genhlth_poor, aes(x = as.factor(physhlth))) +
  geom_bar()

Calculating statistics resulted in a median of 30 and a mean of 23.06.

genhlth_poor %>%
  summarise(p_mean = mean(physhlth), p_median = median(physhlth), p_sd = sd(physhlth),
            p_min = min(physhlth), p_max = max(physhlth))

##     p_mean p_median     p_sd p_min p_max
## 1 23.06016       30 10.33902     0    30

In answer to the original research question, the difference in the means (0.89 versus 23.06) is roughly 22. This is a large difference as was expected. As with the previous analysis, NAs were filtered out.

Research quesion 3:

What are the top three auxiliary forms of exercise reported by female dancers? By male dancers? How do the lists compare? Three variables were involved in this analysis: respondents sex (sex), type of physical activity (exract11), and other type of physical activity (exract21). For answering the question, a person is being classified as a dancer if “Dancing-ballet, ballroom, Latin, hip hop, etc” was their response to the question, “What type of physical activity or exercise did you spend the most time doing during the past month?” (exract11) or if that answer was their response to the follow-up question, “What other type of physical activity gave you the next most exercise during the past month?” (exract21). In other words, for the purpose of the analysis, someone is classified as a dancer if either their primary (exract11) or secondary (exract21) form of exercise they reported for the preceding month was dance.

The first step in the analysis was to create a new data frame focused upon those who qualified as dancers according to the criterion just described.

dancers <- brfss2013 %>%
  select(sex, exract11, exract21) %>%
  filter(exract11 == "Dancing-ballet, ballroom, Latin, hip hop, etc" | exract21 == "Dancing-ballet, ballroom, Latin, hip hop, etc")

Some people listed dance as both their primary exercise and their secondary. In fact, the total number of people who did this was 7. For the analysis, which aims to look at auxiliary exercises for dancers, these people were filtered out.

dancers %>%
  filter(exract11 == "Dancing-ballet, ballroom, Latin, hip hop, etc", exract21 == "Dancing-ballet, ballroom, Latin, hip hop, etc") %>%
  group_by(exract21) %>%
  summarise(count = n())

## # A tibble: 1 × 2
##   exract21                                      count
##   <fct>                                         <int>
## 1 Dancing-ballet, ballroom, Latin, hip hop, etc     7

People who listed dance as both their primary and secondary exercise were filtered out in a new data frame.

dancers2 <- dancers %>%
  filter(xor(exract11 == "Dancing-ballet, ballroom, Latin, hip hop, etc",  exract21 == "Dancing-ballet, ballroom, Latin, hip hop, etc"))

A new variable was created called alt_exercise by using an ifelse construction. If someone listed “Dancing-ballet, ballroom, Latin, hip hop, etc” as their primary exercise, then the value for their alt_exercise was set to be what they listed as their secondary, otherwise the value was set to their primary exercise (because their secondary must have been “Dancing-ballet, ballroom, Latin, hip hop, etc” due to the previous filtering).

dancers2 <- dancers2 %>%
  mutate(alt_exercise = ifelse(exract11 == "Dancing-ballet, ballroom, Latin, hip hop, etc", as.character(exract21), as.character(exract11)))

We can then see what the frequency is for alt_exercise filtered by females.

dancers2 %>%
  group_by(alt_exercise) %>%
  filter(sex == "Female") %>%
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  top_n(n =7)

## Selecting by count

## # A tibble: 7 × 2
##   alt_exercise                                   count
##   <chr>                                          <int>
## 1 Walking                                         2795
## 2 No other activity                                405
## 3 Running                                          259
## 4 Other                                            184
## 5 Aerobics video or class                          157
## 6 Gardening (spading, weeding, digging, filling)   148
## 7 Yoga                                             100

The top three listed were “Walking”, “No other activity”, and “Running”. We can then filter by males.

dancers2 %>%
  group_by(alt_exercise) %>%
  filter(sex == "Male", ) %>%
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  top_n(n =7)

## Selecting by count

## # A tibble: 7 × 2
##   alt_exercise                                              count
##   <chr>                                                     <int>
## 1 Walking                                                     311
## 2 No other activity                                            73
## 3 Weight lifting                                               32
## 4 Running                                                      30
## 5 Other                                                        28
## 6 Bicycling                                                    21
## 7 Yard work (cutting/gathering wood, trimming hedges, etc.)    18

The top three alternates were “Walking”, “No other activity”, and “Weight Lifting”. The male and female lists agree on their top two results – “Walking” and “No other activity”. One might question whether “No other activity” should be included because it is not an activity. If one removes that option, then the third activity in the list for females is “Other” and for males it is “Running”.

In order to plot these results a new data frame was created that only included activities with a frequency greater than 40.

dancers2_filtered <- dancers2[!table(dancers2$alt_exercise)[dancers2$alt_exercise] < 40,]

The results appear in the table below, separating by sex. Clearly, the bars for females are much larger than the males because of the greater proportion of females who listed dance as their primary or secondary exercise as opposed to males.

ggplot(data = dancers2_filtered, aes(x = as.factor(alt_exercise), fill = sex)) +
  geom_bar(position = "dodge") + 
  theme(axis.text.x= element_text(angle=90, hjust = 1, vjust = 0.5))