Project Proposal
Overview
We have selected 3 datasets - namely the College Scorecard Dataset, the Massachusetts Public Schools 2017 Records, and the Stroke Dataset.
1. College Scorecard Dataset
Source: https://collegescorecard.ed.gov/data/
Data was collected from 1996-2023 by the Department of Education by institutional reporting from U.S. schools. The dataset used for this project has been modified to be uploaded to GitHub.
Observations include geographic data for each school, admission rates, institutional characteristics, enrollment, student aid, costs, and student outcomes.
No ethical concerns are present–data is collected from a department level, not individual and includes schools from across the country.
Research Question: What relationships are there between institutional characteristics and student aid, school spending and student outcomes (graduation rate, degree level, etc)?
This question represents an overview of the higher education system and its effectiveness in the United States. We hypothesize that student aid and school spending are correlated with student outcomes.
This question is important because student success is an important factor in determining their success in the workforce, as well as an indicator for future students for the quality of that education.
Variables such as school expenditure, student debt and graduate income are quantitative while variables such as degree type, institution name and city are qualitative.
Data Analysis
import pandas as pd
college = pd.read_csv("data/collge_data_tenative.csv")
print(college.head())
print(college.info())
Unnamed: 0 INSTNM CITY ADM_RATE \
0 0 Alabama A & M University Normal 0.6840
1 1 University of Alabama at Birmingham Birmingham 0.8668
2 2 Amridge University Montgomery NaN
3 3 University of Alabama in Huntsville Huntsville 0.7810
4 4 Alabama State University Montgomery 0.9660
SAT_AVG UGDS UG COSTT4_A COSTT4_P TUITIONFEE_IN TUITIONFEE_OUT \
0 920.0 5196.0 NaN 23167.0 NaN 10024.0 18634.0
1 1291.0 12776.0 NaN 26257.0 NaN 8832.0 21216.0
2 NaN 228.0 NaN NaN NaN NaN NaN
3 1259.0 6985.0 NaN 25777.0 NaN 11878.0 24770.0
4 963.0 3296.0 NaN 21900.0 NaN 11068.0 19396.0
GRAD_DEBT_MDN WDRAW_DEBT_MDN FAMINC
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6484 entries, 0 to 6483
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 6484 non-null int64
1 INSTNM 6484 non-null object
2 CITY 6484 non-null object
3 ADM_RATE 1956 non-null float64
4 SAT_AVG 1089 non-null float64
5 UGDS 5716 non-null float64
6 UG 0 non-null float64
7 COSTT4_A 3316 non-null float64
8 COSTT4_P 1951 non-null float64
9 TUITIONFEE_IN 3771 non-null float64
10 TUITIONFEE_OUT 3771 non-null float64
11 GRAD_DEBT_MDN 0 non-null float64
12 WDRAW_DEBT_MDN 0 non-null float64
13 FAMINC 0 non-null float64
dtypes: float64(11), int64(1), object(2)
memory usage: 709.3+ KB
None
2. Massachusetts Public Schools 2017 Records
Source: https://profiles.doe.mass.edu/
This data was collected by the Massachusetts Department of Education in 2017 from school reports given by public elementary, middle and high schools.
Observations include enrollment and graduation rates, class sizes, demographic and socioeconomic information of students, classes offered and AP testing records.
No ethical concerns are present.
Research Question: Is there a relationship between graduation rates and higher education preparation and a school’s demographic makeup, funding and class size?
This question is important because the education quality for elementary through high school aged children is an important factor in their success in higher education, their careers and overall quality of life.
Variables such as school type, location, and evaluation are qualitative variables. Variables such as demographic makeup, enrollment numbers, class size and expenditure are quantitative variables.
Data Analysis
MA_schools = pd.read_csv("data/MA_Public_Schools_2017.csv")
print(MA_schools.head())
print(MA_schools.info())
School Code School Name School Type Function \
0 10505 Abington High Public School Principal
1 10003 Beaver Brook Elementary School Public School Principal
2 10002 Center Elementary School Public School Principal
3 10405 Frolio Middle School Public School Principal
4 10015 Woodsdale Elementary School Public School Principal
Contact Name Address 1 Address 2 Town State Zip \
0 Teresa Sullivan-Cruz 201 Gliniewicz Way NaN Abington MA 2351
1 Catherine Zinni 1 Ralph Hamlin Lane NaN Abington MA 2351
2 Lora Monachino 201 Gliniewicz Way NaN Abington MA 2351
3 Matthew MacCurtain 201 Gliniewicz Way NaN Abington MA 2351
4 Jonathan Hawes 128 Chestnut Street NaN Abington MA 2351
... MCAS_10thGrade_English_Incl. in SGP(#) \
0 ... 111.0
1 ... NaN
2 ... NaN
3 ... NaN
4 ... NaN
Accountability and Assistance Level \
0 Level 1
1 Level 3
2 Insufficient data
3 Level 2
4 Level 2
Accountability and Assistance Description \
0 Meeting gap narrowing goals
1 Among lowest performing 20% of subgroups
2 NaN
3 Not meeting gap narrowing goals
4 Not meeting gap narrowing goals
School Accountability Percentile (1-99) \
0 42.0
1 34.0
2 NaN
3 40.0
4 52.0
Progress and Performance Index (PPI) - All Students \
0 76.0
1 69.0
2 NaN
3 63.0
4 65.0
Progress and Performance Index (PPI) - High Needs Students \
0 75.0
1 73.0
2 NaN
3 64.0
4 67.0
District_Accountability and Assistance Level \
0 Level 3
1 Level 3
2 Level 3
3 Level 3
4 Level 3
District_Accountability and Assistance Description \
0 One or more schools in the district classified...
1 One or more schools in the district classified...
2 One or more schools in the district classified...
3 One or more schools in the district classified...
4 One or more schools in the district classified...
District_Progress and Performance Index (PPI) - All Students \
0 63.0
1 63.0
2 63.0
3 63.0
4 63.0
District_Progress and Performance Index (PPI) - High Needs Students
0 60.0
1 60.0
2 60.0
3 60.0
4 60.0
[5 rows x 302 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1861 entries, 0 to 1860
Columns: 302 entries, School Code to District_Progress and Performance Index (PPI) - High Needs Students
dtypes: float64(265), int64(19), object(18)
memory usage: 4.3+ MB
None
3. Stroke Dataset
Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
This data comes from a study published in China in 2020: Pathan, Muhammad Salman & Zhang, Jianbiao & John, Deepu & Nag, Avishek & Dev, Soumyabrata. (2020). Identifying Stroke Indicators Using Rough Sets. IEEE Access. 8. 10.1109/ACCESS.2020.3039439.
Observations include demographic information such as age, gender, marital status, and employment status, as well as medical information such as BMI, glucose levels, smoking history, etc.
A possible ethical concern is if personal health information (PHI) could be traced to individual participants. However, data has been de-identified.
Research Question: is it possible to predict the possibility of an individual having a stroke based on their demographic information and medical history?
This question is important because identifying key risk factors of a stroke can be used for prevention and more effective monitoring of patients.
Variables such as gender, smoking status, employment status, and marital status are qualitative. Variables such as age, BMI, and glucose index are all quantitative.
Analysis Plan: Step 1: correlation plot/sensitivity analysis, choose variables Step 2: model testing and selection Step 3: model visualization Step 4: presentation
Data Analysis
stroke = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
print(stroke.head())
print(stroke.info())
id gender age hypertension heart_disease ever_married \
0 9046 Male 67.0 0 1 Yes
1 51676 Female 61.0 0 0 Yes
2 31112 Male 80.0 0 1 Yes
3 60182 Female 49.0 0 0 Yes
4 1665 Female 79.0 1 0 Yes
work_type Residence_type avg_glucose_level bmi smoking_status \
0 Private Urban 228.69 36.6 formerly smoked
1 Self-employed Rural 202.21 NaN never smoked
2 Private Rural 105.92 32.5 never smoked
3 Private Urban 171.23 34.4 smokes
4 Self-employed Rural 174.12 24.0 never smoked
stroke
0 1
1 1
2 1
3 1
4 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5110 non-null int64
1 gender 5110 non-null object
2 age 5110 non-null float64
3 hypertension 5110 non-null int64
4 heart_disease 5110 non-null int64
5 ever_married 5110 non-null object
6 work_type 5110 non-null object
7 Residence_type 5110 non-null object
8 avg_glucose_level 5110 non-null float64
9 bmi 4909 non-null float64
10 smoking_status 5110 non-null object
11 stroke 5110 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
None