Data, Data, Data

Oh My

Final Project: Milestone 2
Author
Affiliation

The Outliers

School of Information, University of Arizona

import numpy as np
import seaborn as sns
import pandas as pd

Dataset 1: Nutrition, physical activity, and obesity

Introduction and data

This dataset was provided by the Centers for Disease Control and Provention (CDC), National Center for Chronic Disease prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity. This data was collected through health-related telephone surveys that gather state data about U.S. Residents. This dataset has been used for the Data, Trends, and Maps database that the Division of Nutrition, Physical Activity, and Obesity (DNPAO) section of the CDC has, which is responsible for providing both state and national data for these topics.

Description of contents

This dataset includes over 104 thousand rows, and has 33 columns. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total.

Survey questions fall into the categories of “Fruits and Vegetables - Behavior”, “Obesity/Weight Status”, and “Physical Activity - Behavior”. Examples of survey questions include “Percent of adults who engage in muscle-strengthening activities on 2 or more days a week” and “Percent of adults aged 18 years and older who have obesity”.

This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.

Ethical Concerns

There are no particular ethical concerns regarding working with this data. This dataset is aggregated, and numbers are excluded in instances where the sample size is too small. This removes concerns surrounding the personal identification of individuals within this dataset. This dataset is publicly available for anyone to download, and the licensing agreement states that it is free to be shared, created, and adapted, as long as it is attributed as the data source when publicly displayed or published. This removes concerns surrounding unfair or illegal acquisition and use of the data.

Research Question

  • Research Question: Do higher-income populations consistently have more time for physical activity than lower income populations?

    • Additional research questions include: How has the relationship between amount of physical activity and income changed over time? How does this vary between groups? And how does the amount of physical activity that lower-income populations have the time to do changed over the years?

    • The target population for this research question is U.S. Residents 18 and over, represented by the dataset

  • This question is important because it may highlight areas that correlate with differences in physical health across the population. If there are groupings that are identified that are tied to physical health and activity more than others, then more research can be done to identify ways in which these groups can receive more assistance with nutrition and adopting healthier lifestyles. 

  • The research topic of interest here is whether or not there is a relationship between the amount of income that an individual makes and the amount of physical activity they are able to make time for. This falls in a larger category of interest surrounding differences between physical activity and nutrition for different groups and subsets of the population.

  • We hypothesize that individuals who have lower income levels will have less time for physical activity, showing that larger percentages of the low income population will fall into the “Percent of adults who engage in no leisure-time physical activity” group compared to individuals with higher incomes. 

  • The variables in this research question are mostly categorical, the questions themselves and the groupings (income range) are both categorical. The percentage of respondents who fall into each category is a quantitative variable, and there is a time element (years) which can be used as well.

Glimpse of data

nutrition = pd.read_csv("data/nutrition.csv")
nutrition.head()
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value_Unit Data_Value_Type ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Race/Ethnicity 2 or more races RACE RACE2PLUS
1 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Race/Ethnicity Other RACE RACEOTH
2 2011 2011 AK Alaska BRFSS Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 150 min... 2011.0 Value ... (64.845079957001, -147.722059036) PA PA1 Q044 VALUE 2 Gender Female GEN FEMALE
3 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Age (years) 35 - 44 AGEYR AGEYR3544
4 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q037 VALUE 2 Income $15,000 - $24,999 INC INC1525

5 rows × 33 columns

nutrition.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104272 entries, 0 to 104271
Data columns (total 33 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   YearStart                   104272 non-null  int64  
 1   YearEnd                     104272 non-null  int64  
 2   LocationAbbr                104272 non-null  object 
 3   LocationDesc                104272 non-null  object 
 4   Datasource                  104272 non-null  object 
 5   Class                       104272 non-null  object 
 6   Topic                       104272 non-null  object 
 7   Question                    104272 non-null  object 
 8   Data_Value_Unit             88872 non-null   float64
 9   Data_Value_Type             104272 non-null  object 
 10  Data_Value                  93505 non-null   float64
 11  Data_Value_Alt              93505 non-null   float64
 12  Data_Value_Footnote_Symbol  10767 non-null   object 
 13  Data_Value_Footnote         10767 non-null   object 
 14  Low_Confidence_Limit        93505 non-null   float64
 15  High_Confidence_Limit       93505 non-null   float64
 16  Sample_Size                 93505 non-null   float64
 17  Total                       3724 non-null    object 
 18  Age(years)                  22344 non-null   object 
 19  Education                   14896 non-null   object 
 20  Gender                      7448 non-null    object 
 21  Income                      26068 non-null   object 
 22  Race/Ethnicity              29792 non-null   object 
 23  GeoLocation                 102340 non-null  object 
 24  ClassID                     104272 non-null  object 
 25  TopicID                     104272 non-null  object 
 26  QuestionID                  104272 non-null  object 
 27  DataValueTypeID             104272 non-null  object 
 28  LocationID                  104272 non-null  int64  
 29  StratificationCategory1     104272 non-null  object 
 30  Stratification1             104272 non-null  object 
 31  StratificationCategoryId1   104272 non-null  object 
 32  StratificationID1           104272 non-null  object 
dtypes: float64(6), int64(3), object(24)
memory usage: 26.3+ MB

Analysis plan

Initial detailed exploration of the data (focusing on the main variables involved) will be followed by data cleaning and some wrangling. Identification of missing observations, data type conversions, etc. will be addressed in these initial steps. The variables involved to answer the largest research question include the “Question” and “Income” and “StratificationCategoryId1” columns, as well as the “Data_Value” column and the “YearStart” column. New columns that group some of these variables may also be created, such as grouping the “Income” values into “Low Income” and “High Income”. At this point there is no plan to bring in and merge any external data.

Once the data has been wrangled, it can be visualized in an assortment of plots for analysis, and summary statistics by group can be compiled as well. These visualizations and statistics will allow for insight as to how the relationship between income and physical activity compares between groups, as well as over time. From here, additional metrics can be run, plots created, etc. to explore and dive in further.