Exploring income and physical activity disparities in the US
INFO 511 - Fall 2024 - Final Project
Abstract
In this Final Project for INFO 511: Fundamentals of Data Science, we set out to investigate the potential relationship between physical activity and income level. Through analysis of historical health survey data, we attempt to better understand whether there is a visible trend or relationship between socioeconomic status and leisure-time physical activity in US adults.
Introduction
Understanding the relationship between socioeconomic status and health behaviors is necessary for addressing disparities in public health outcomes. Our project seeks to understand whether higher-income populations consistently have more time for physical activity than lower income populations using a dataset from the Centers for Disease Control and Prevention (CDC). The dataset is specifically from the Behavioral Risk Factor Surveillance System project and was obtained from phone surveys conducted between 2011 and 2023. The whole dataset offers insights into physical activity, nutrition, and obesity trends among U.S. residents aged 18 and older. For the purpose of this project, we are focusing on the survey questions related to physical activity. The data is stratified by factors such as age, education, gender, income, and race/ethnicity.
Research Question
Do higher-income populations have more time for physical activity than lower income populations?
We hypothesize that this is true, higher income populations have more time for physical activity. Therefore, populations will engage in more physical activity as their income level increases (positive relationship).
Data
Dataset: Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System
This dataset is hosted by the United States Center for Disease Control and was obtained from the Behavioral Risk Factor Surveillance System, a CDC project consisting of health-related phone surveys. The original dataset consists of 104,000 rows and 33 columns. Descriptions of all columns are available on the link above. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. Data_Value contains the corresponding value collected for each survey question. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total. This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.
The main columns of interest for our research question are:
- YearStart and YearEnd: The year the data was collected. These are the same for every row.
- LocationAbbr and LocationDesc: Contains the abbreviation for the data where the data was collected.
- Topic: Contains the topic the variable being measured falls into. For our research question, we are interested in the topic “Physical Activity - Behavior”
- Question: What is being measured. Within “Physical Activity - Behavior” there are 5 questions, which are listed in the “Data Cleaning and Wrangling, EDA” section below.
- Data_Value: The value being measured by the survey, in this case for these specific questions it will be a percentage.
- StratificationCategory1: What variable the data is being stratified by. Depending on the value in this column, it will contain a value in the columns “Race”, “Age (years)”, “Income”, etc. For our research question we are interested in the levels in the Income column.
- Income: Contains the income level as a string representing an income range, such as ‘Less than $15,000’, ‘$35,000 - $49,999’, etc.
Data Cleaning and Wrangling, EDA
The columns YearStart and YearEnd always contain the same values so one of these columns can be dropped. We are only interested in the rows containing questions related to physical activity. Specifically, we are interested in the rows corresponding to the value in column Question
that describes: “Percent of adults who engage in no leisure-time physical activity”. There were five measurements collected related to physical activity from the phone surveys from which the dataset was derived, though not all of these questions were asked every year. The following statements are how the measurements were described in the Question
column, but were not the way the questions were presented to participants over the phone. The exact wording of the questions is available on the Behavioral Risk Factor Surveillance System website.
Percent of adults who engage in no leisure-time physical activity
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Percent of adults who engage in muscle-strengthening activities on 2 or more days a week
Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
There are multiple columns containing the categories for stratification, such as education levels in the column Education
. To answer our research question, we are most interested in the Income
column. The data can also be separated by US state using the LocationAbbr
(state abbreviation, i.e. “AZ”) or LocationDesc
(full state text, i.e. “Arizona”) columns. There are also national measurements, using the abbreviation “US” or full text “National”.
Some of the columns contained missing values. Missing values were dropped for the column Data_Value
because the percentage values in this column were the focus of our analysis, and we were not interested in the years where this question was not measured. One abnormality of note in this dataset was the column Data_Value_Unit
. This column was described as containing the unit of measurement for Data_Value
, but this was not the case as the column contained years followed by a period. This may be indicative of a data entry error.
Encoding income as a numeric value
The income level of the participants is encoded in the dataset as a string value. The income levels were placed into the following bins:
Less than $15,000
$15,000 - $24,999
$25,000 - $34,999
$35,000 - $49,999
$50,000 - $74,999
$75,000 or greater
This can be used as a nominal ordinal variable, but to do linear regression this must be converted to a numeric variable. For this project, we encoded the income as the first number in the range, such as 15,000 for $15,000 - $24,999.
Visualization of Dataset
Overview of National Levels of Leisure-Time Physical Activity
Figure 1: Percent of US adults who do not engage in leisure-time physical activity, from 2011-2023
This violin plot shows the distribution and density of the data for each income group, with the y-axis representing the percentage of adults not engaging in physical activity and the x-axis denoting income categories. The lowest income category, “Less than $15,000”, has the highest percentage of adults who do not engage in any leisure-time physical activity, and the opposite is true for the highest income category, “$75,000 or greater”, which has the most adults who do engage in leisure-time physical activity.
This plot provides a general overview of the trends in the physical activity of American adults as related to income. Further analysis of this dataset shows the same association between higher income and higher percentages of adults who engage in physical activity.
Figure 2: Percent of U.S. national population engaging in each exercise class over years
Our team also wanted to assess whether the income and activity disparity carried through the other questions about exercise in the survey. This facet grid helps compare the disparities between income groups for all exercise categories measured in the survey. On the far left, you can see that there is a large disparity between the lowest and highest income bracket when it comes to no leisure time physical activity. However, that gap starts to shrink as we get to different types of exercise.
It is important to note that the only question in the survey that explicitly asked participants to only consider non-job related physical activity is the question about non-leisure time physical activity. This may be one reason the disparity begins to shrink so drastically in other exercise categories.
Figure 3: Percent of sample population who engage in no leisure time activity versus income levels
This scatter plot shows the percent of the sample population who does not engage in leisure time activity on the Y-axis, and their income on the X-axis. The income is encoded as the minimum income in the range as a numeric value, so it can be used as a numeric variable for linear regression.
The linear regression line shows a downward trend, with lower percentages engaging in no leisure time activity associated with lower incomes. The R squared value of 0.7 shows a moderately strong correlation between the percentage and income.
Results
According to our analysis, the lowest income bracket population had the highest percentage of people who engaged in no leisure-time physical activity. The disparity between the lowest and highest income bracket populations was around 25 percentage points. The regression analysis demonstrates a strong correlation between income and the percent of a population who engage in no leisure-time physical activity.
That disparity starts to shrink as we get to different types of exercise, but a gap of about 10 percentage points exists in most other exercise categories.
Discussion
Our work supports the hypothesis we have that people with higher incomes have generally more time to engage in exercise in their free time. While the percent of those not engaging in physical activity has increased in the past 10 years, the lowest income brackets continue to have the largest percentage of their population who engage in no leisure-time physical activity.
While the disparity in other exercises categories was less than for no leisure-time physical activity, it is important to note that the only question in the survey that explicitly asked participants to only consider non-job related physical activity is the question about non-leisure time physical activity. This may be one reason the disparity begins to shrink so drastically in other exercise categories.
To further investigate the disparities in physical activity using this data set, we could further break down the data by other demographics, such as age, ethnicity, education, and location in future work. With this data set, it would also be insightful to address some of the problems with this data, such as certain questions being missing for certain years and insufficient sample sizes for some rows.