Project Proposal

Author

Affiliation

Coding Wildcats

School of Information, University of Arizona

import numpy as np
import seaborn as sns

Overview

We have selected 3 datasets - namely the College Scorecard Dataset, the Massachusetts Public Schools 2017 Records, and the Stroke Dataset.

1. College Scorecard Dataset

Source: https://collegescorecard.ed.gov/data/

Data was collected from 1996-2023 by the Department of Education by institutional reporting from U.S. schools. The dataset used for this project has been modified to be uploaded to GitHub.

Observations include geographic data for each school, admission rates, institutional characteristics, enrollment, student aid, costs, and student outcomes.

No ethical concerns are present–data is collected from a department level, not individual and includes schools from across the country.

Research Question: What relationships are there between institutional characteristics and student aid, school spending and student outcomes (graduation rate, degree level, etc)?

This question represents an overview of the higher education system and its effectiveness in the United States. We hypothesize that student aid and school spending are correlated with student outcomes.

This question is important because student success is an important factor in determining their success in the workforce, as well as an indicator for future students for the quality of that education.

Variables such as school expenditure, student debt and graduate income are quantitative while variables such as degree type, institution name and city are qualitative.

Data Analysis

import pandas as pd
college = pd.read_csv("data/collge_data_tenative.csv")
print(college.head())
print(college.info())

   Unnamed: 0                               INSTNM        CITY  ADM_RATE  \
0           0             Alabama A & M University      Normal    0.6840   
1           1  University of Alabama at Birmingham  Birmingham    0.8668   
2           2                   Amridge University  Montgomery       NaN   
3           3  University of Alabama in Huntsville  Huntsville    0.7810   
4           4             Alabama State University  Montgomery    0.9660   

   SAT_AVG     UGDS  UG  COSTT4_A  COSTT4_P  TUITIONFEE_IN  TUITIONFEE_OUT  \
0    920.0   5196.0 NaN   23167.0       NaN        10024.0         18634.0   
1   1291.0  12776.0 NaN   26257.0       NaN         8832.0         21216.0   
2      NaN    228.0 NaN       NaN       NaN            NaN             NaN   
3   1259.0   6985.0 NaN   25777.0       NaN        11878.0         24770.0   
4    963.0   3296.0 NaN   21900.0       NaN        11068.0         19396.0   

   GRAD_DEBT_MDN  WDRAW_DEBT_MDN  FAMINC  
0            NaN             NaN     NaN  
1            NaN             NaN     NaN  
2            NaN             NaN     NaN  
3            NaN             NaN     NaN  
4            NaN             NaN     NaN  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6484 entries, 0 to 6483
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      6484 non-null   int64  
 1   INSTNM          6484 non-null   object 
 2   CITY            6484 non-null   object 
 3   ADM_RATE        1956 non-null   float64
 4   SAT_AVG         1089 non-null   float64
 5   UGDS            5716 non-null   float64
 6   UG              0 non-null      float64
 7   COSTT4_A        3316 non-null   float64
 8   COSTT4_P        1951 non-null   float64
 9   TUITIONFEE_IN   3771 non-null   float64
 10  TUITIONFEE_OUT  3771 non-null   float64
 11  GRAD_DEBT_MDN   0 non-null      float64
 12  WDRAW_DEBT_MDN  0 non-null      float64
 13  FAMINC          0 non-null      float64
dtypes: float64(11), int64(1), object(2)
memory usage: 709.3+ KB
None

2. Massachusetts Public Schools 2017 Records

Source: https://profiles.doe.mass.edu/

This data was collected by the Massachusetts Department of Education in 2017 from school reports given by public elementary, middle and high schools.

Observations include enrollment and graduation rates, class sizes, demographic and socioeconomic information of students, classes offered and AP testing records.

No ethical concerns are present.

Research Question: Is there a relationship between graduation rates and higher education preparation and a school’s demographic makeup, funding and class size?

This question is important because the education quality for elementary through high school aged children is an important factor in their success in higher education, their careers and overall quality of life.

Variables such as school type, location, and evaluation are qualitative variables. Variables such as demographic makeup, enrollment numbers, class size and expenditure are quantitative variables.

Data Analysis

MA_schools = pd.read_csv("data/MA_Public_Schools_2017.csv")
print(MA_schools.head())
print(MA_schools.info())

   School Code                     School Name    School Type   Function  \
0        10505                   Abington High  Public School  Principal   
1        10003  Beaver Brook Elementary School  Public School  Principal   
2        10002        Center Elementary School  Public School  Principal   
3        10405            Frolio Middle School  Public School  Principal   
4        10015     Woodsdale Elementary School  Public School  Principal   

           Contact Name            Address 1 Address 2      Town State   Zip  \
0  Teresa Sullivan-Cruz   201 Gliniewicz Way       NaN  Abington    MA  2351   
1       Catherine Zinni  1 Ralph Hamlin Lane       NaN  Abington    MA  2351   
2        Lora Monachino   201 Gliniewicz Way       NaN  Abington    MA  2351   
3    Matthew MacCurtain   201 Gliniewicz Way       NaN  Abington    MA  2351   
4        Jonathan Hawes  128 Chestnut Street       NaN  Abington    MA  2351   

   ... MCAS_10thGrade_English_Incl. in SGP(#)  \
0  ...                                  111.0   
1  ...                                    NaN   
2  ...                                    NaN   
3  ...                                    NaN   
4  ...                                    NaN   

  Accountability and Assistance Level  \
0                             Level 1   
1                             Level 3   
2                   Insufficient data   
3                             Level 2   
4                             Level 2   

  Accountability and Assistance Description  \
0               Meeting gap narrowing goals   
1  Among lowest performing 20% of subgroups   
2                                       NaN   
3           Not meeting gap narrowing goals   
4           Not meeting gap narrowing goals   

  School Accountability Percentile (1-99)  \
0                                    42.0   
1                                    34.0   
2                                     NaN   
3                                    40.0   
4                                    52.0   

   Progress and Performance Index (PPI) - All Students  \
0                                               76.0     
1                                               69.0     
2                                                NaN     
3                                               63.0     
4                                               65.0     

   Progress and Performance Index (PPI) - High Needs Students  \
0                                               75.0            
1                                               73.0            
2                                                NaN            
3                                               64.0            
4                                               67.0            

   District_Accountability and Assistance Level  \
0                                       Level 3   
1                                       Level 3   
2                                       Level 3   
3                                       Level 3   
4                                       Level 3   

   District_Accountability and Assistance Description  \
0  One or more schools in the district classified...    
1  One or more schools in the district classified...    
2  One or more schools in the district classified...    
3  One or more schools in the district classified...    
4  One or more schools in the district classified...    

   District_Progress and Performance Index (PPI) - All Students  \
0                                               63.0              
1                                               63.0              
2                                               63.0              
3                                               63.0              
4                                               63.0              

   District_Progress and Performance Index (PPI) - High Needs Students  
0                                               60.0                    
1                                               60.0                    
2                                               60.0                    
3                                               60.0                    
4                                               60.0                    

[5 rows x 302 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1861 entries, 0 to 1860
Columns: 302 entries, School Code to District_Progress and Performance Index (PPI) - High Needs Students
dtypes: float64(265), int64(19), object(18)
memory usage: 4.3+ MB
None

3. Stroke Dataset

Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

This data comes from a study published in China in 2020: Pathan, Muhammad Salman & Zhang, Jianbiao & John, Deepu & Nag, Avishek & Dev, Soumyabrata. (2020). Identifying Stroke Indicators Using Rough Sets. IEEE Access. 8. 10.1109/ACCESS.2020.3039439.

Observations include demographic information such as age, gender, marital status, and employment status, as well as medical information such as BMI, glucose levels, smoking history, etc.

A possible ethical concern is if personal health information (PHI) could be traced to individual participants. However, data has been de-identified.

Research Question: is it possible to predict the possibility of an individual having a stroke based on their demographic information and medical history?

This question is important because identifying key risk factors of a stroke can be used for prevention and more effective monitoring of patients.

Variables such as gender, smoking status, employment status, and marital status are qualitative. Variables such as age, BMI, and glucose index are all quantitative.

Analysis Plan: Step 1: correlation plot/sensitivity analysis, choose variables Step 2: model testing and selection Step 3: model visualization Step 4: presentation

Data Analysis

stroke = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
print(stroke.head())
print(stroke.info())

      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
None