Data, Data, Data
Oh My
Dataset 1: Nutrition, physical activity, and obesity
Introduction and data
This dataset was provided by the Centers for Disease Control and Provention (CDC), National Center for Chronic Disease prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity. This data was collected through health-related telephone surveys that gather state data about U.S. Residents. This dataset has been used for the Data, Trends, and Maps database that the Division of Nutrition, Physical Activity, and Obesity (DNPAO) section of the CDC has, which is responsible for providing both state and national data for these topics.
Description of contents
This dataset includes over 104 thousand rows, and has 33 columns. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total.
Survey questions fall into the categories of “Fruits and Vegetables - Behavior”, “Obesity/Weight Status”, and “Physical Activity - Behavior”. Examples of survey questions include “Percent of adults who engage in muscle-strengthening activities on 2 or more days a week” and “Percent of adults aged 18 years and older who have obesity”.
This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.
Ethical Concerns
There are no particular ethical concerns regarding working with this data. This dataset is aggregated, and numbers are excluded in instances where the sample size is too small. This removes concerns surrounding the personal identification of individuals within this dataset. This dataset is publicly available for anyone to download, and the licensing agreement states that it is free to be shared, created, and adapted, as long as it is attributed as the data source when publicly displayed or published. This removes concerns surrounding unfair or illegal acquisition and use of the data.
Research Question
Research Question: Do higher-income populations consistently have more time for physical activity than lower income populations?
Additional research questions include: How has the relationship between amount of physical activity and income changed over time? How does this vary between groups? And how does the amount of physical activity that lower-income populations have the time to do changed over the years?
The target population for this research question is U.S. Residents 18 and over, represented by the dataset
This question is important because it may highlight areas that correlate with differences in physical health across the population. If there are groupings that are identified that are tied to physical health and activity more than others, then more research can be done to identify ways in which these groups can receive more assistance with nutrition and adopting healthier lifestyles.
The research topic of interest here is whether or not there is a relationship between the amount of income that an individual makes and the amount of physical activity they are able to make time for. This falls in a larger category of interest surrounding differences between physical activity and nutrition for different groups and subsets of the population.
We hypothesize that individuals who have lower income levels will have less time for physical activity, showing that larger percentages of the low income population will fall into the “Percent of adults who engage in no leisure-time physical activity” group compared to individuals with higher incomes.
The variables in this research question are mostly categorical, the questions themselves and the groupings (income range) are both categorical. The percentage of respondents who fall into each category is a quantitative variable, and there is a time element (years) which can be used as well.
Glimpse of data
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value_Unit | Data_Value_Type | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Race/Ethnicity | 2 or more races | RACE | RACE2PLUS |
1 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Race/Ethnicity | Other | RACE | RACEOTH |
2 | 2011 | 2011 | AK | Alaska | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 150 min... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | PA | PA1 | Q044 | VALUE | 2 | Gender | Female | GEN | FEMALE |
3 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Age (years) | 35 - 44 | AGEYR | AGEYR3544 |
4 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q037 | VALUE | 2 | Income | $15,000 - $24,999 | INC | INC1525 |
5 rows × 33 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104272 entries, 0 to 104271
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearStart 104272 non-null int64
1 YearEnd 104272 non-null int64
2 LocationAbbr 104272 non-null object
3 LocationDesc 104272 non-null object
4 Datasource 104272 non-null object
5 Class 104272 non-null object
6 Topic 104272 non-null object
7 Question 104272 non-null object
8 Data_Value_Unit 88872 non-null float64
9 Data_Value_Type 104272 non-null object
10 Data_Value 93505 non-null float64
11 Data_Value_Alt 93505 non-null float64
12 Data_Value_Footnote_Symbol 10767 non-null object
13 Data_Value_Footnote 10767 non-null object
14 Low_Confidence_Limit 93505 non-null float64
15 High_Confidence_Limit 93505 non-null float64
16 Sample_Size 93505 non-null float64
17 Total 3724 non-null object
18 Age(years) 22344 non-null object
19 Education 14896 non-null object
20 Gender 7448 non-null object
21 Income 26068 non-null object
22 Race/Ethnicity 29792 non-null object
23 GeoLocation 102340 non-null object
24 ClassID 104272 non-null object
25 TopicID 104272 non-null object
26 QuestionID 104272 non-null object
27 DataValueTypeID 104272 non-null object
28 LocationID 104272 non-null int64
29 StratificationCategory1 104272 non-null object
30 Stratification1 104272 non-null object
31 StratificationCategoryId1 104272 non-null object
32 StratificationID1 104272 non-null object
dtypes: float64(6), int64(3), object(24)
memory usage: 26.3+ MB
Analysis plan
Initial detailed exploration of the data (focusing on the main variables involved) will be followed by data cleaning and some wrangling. Identification of missing observations, data type conversions, etc. will be addressed in these initial steps. The variables involved to answer the largest research question include the “Question” and “Income” and “StratificationCategoryId1” columns, as well as the “Data_Value” column and the “YearStart” column. New columns that group some of these variables may also be created, such as grouping the “Income” values into “Low Income” and “High Income”. At this point there is no plan to bring in and merge any external data.
Once the data has been wrangled, it can be visualized in an assortment of plots for analysis, and summary statistics by group can be compiled as well. These visualizations and statistics will allow for insight as to how the relationship between income and physical activity compares between groups, as well as over time. From here, additional metrics can be run, plots created, etc. to explore and dive in further.