Data, Data, Data

Oh My

Final Project: Milestone 2
Author
Affiliation

The Outliers

School of Information, University of Arizona

import numpy as np
import seaborn as sns
import pandas as pd

Dataset 1: Nutrition, physical activity, and obesity

Introduction and data

This dataset was provided by the Centers for Disease Control and Provention (CDC), National Center for Chronic Disease prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity. This data was collected through health-related telephone surveys that gather state data about U.S. Residents. This dataset has been used for the Data, Trends, and Maps database that the Division of Nutrition, Physical Activity, and Obesity (DNPAO) section of the CDC has, which is responsible for providing both state and national data for these topics.

Description of contents

This dataset includes over 104 thousand rows, and has 33 columns. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total.

Survey questions fall into the categories of “Fruits and Vegetables - Behavior”, “Obesity/Weight Status”, and “Physical Activity - Behavior”. Examples of survey questions include “Percent of adults who engage in muscle-strengthening activities on 2 or more days a week” and “Percent of adults aged 18 years and older who have obesity”.

This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.

Ethical Concerns

There are no particular ethical concerns regarding working with this data. This dataset is aggregated, and numbers are excluded in instances where the sample size is too small. This removes concerns surrounding the personal identification of individuals within this dataset. This dataset is publicly available for anyone to download, and the licensing agreement states that it is free to be shared, created, and adapted, as long as it is attributed as the data source when publicly displayed or published. This removes concerns surrounding unfair or illegal acquisition and use of the data.

Research Question

  • Research Question: Do higher-income populations consistently have more time for physical activity than lower income populations?

    • Additional research questions include: How has the relationship between amount of physical activity and income changed over time? How does this vary between groups? And how does the amount of physical activity that lower-income populations have the time to do changed over the years?

    • The target population for this research question is U.S. Residents 18 and over, represented by the dataset

  • This question is important because it may highlight areas that correlate with differences in physical health across the population. If there are groupings that are identified that are tied to physical health and activity more than others, then more research can be done to identify ways in which these groups can receive more assistance with nutrition and adopting healthier lifestyles. 

  • The research topic of interest here is whether or not there is a relationship between the amount of income that an individual makes and the amount of physical activity they are able to make time for. This falls in a larger category of interest surrounding differences between physical activity and nutrition for different groups and subsets of the population.

  • We hypothesize that individuals who have lower income levels will have less time for physical activity, showing that larger percentages of the low income population will fall into the “Percent of adults who engage in no leisure-time physical activity” group compared to individuals with higher incomes. 

  • The variables in this research question are mostly categorical, the questions themselves and the groupings (income range) are both categorical. The percentage of respondents who fall into each category is a quantitative variable, and there is a time element (years) which can be used as well.

Glimpse of data

nutrition = pd.read_csv("data/nutrition.csv")
nutrition.head()
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value_Unit Data_Value_Type ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Race/Ethnicity 2 or more races RACE RACE2PLUS
1 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Race/Ethnicity Other RACE RACEOTH
2 2011 2011 AK Alaska BRFSS Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 150 min... 2011.0 Value ... (64.845079957001, -147.722059036) PA PA1 Q044 VALUE 2 Gender Female GEN FEMALE
3 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q036 VALUE 2 Age (years) 35 - 44 AGEYR AGEYR3544
4 2011 2011 AK Alaska BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 2011.0 Value ... (64.845079957001, -147.722059036) OWS OWS1 Q037 VALUE 2 Income $15,000 - $24,999 INC INC1525

5 rows × 33 columns

nutrition.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104272 entries, 0 to 104271
Data columns (total 33 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   YearStart                   104272 non-null  int64  
 1   YearEnd                     104272 non-null  int64  
 2   LocationAbbr                104272 non-null  object 
 3   LocationDesc                104272 non-null  object 
 4   Datasource                  104272 non-null  object 
 5   Class                       104272 non-null  object 
 6   Topic                       104272 non-null  object 
 7   Question                    104272 non-null  object 
 8   Data_Value_Unit             88872 non-null   float64
 9   Data_Value_Type             104272 non-null  object 
 10  Data_Value                  93505 non-null   float64
 11  Data_Value_Alt              93505 non-null   float64
 12  Data_Value_Footnote_Symbol  10767 non-null   object 
 13  Data_Value_Footnote         10767 non-null   object 
 14  Low_Confidence_Limit        93505 non-null   float64
 15  High_Confidence_Limit       93505 non-null   float64
 16  Sample_Size                 93505 non-null   float64
 17  Total                       3724 non-null    object 
 18  Age(years)                  22344 non-null   object 
 19  Education                   14896 non-null   object 
 20  Gender                      7448 non-null    object 
 21  Income                      26068 non-null   object 
 22  Race/Ethnicity              29792 non-null   object 
 23  GeoLocation                 102340 non-null  object 
 24  ClassID                     104272 non-null  object 
 25  TopicID                     104272 non-null  object 
 26  QuestionID                  104272 non-null  object 
 27  DataValueTypeID             104272 non-null  object 
 28  LocationID                  104272 non-null  int64  
 29  StratificationCategory1     104272 non-null  object 
 30  Stratification1             104272 non-null  object 
 31  StratificationCategoryId1   104272 non-null  object 
 32  StratificationID1           104272 non-null  object 
dtypes: float64(6), int64(3), object(24)
memory usage: 26.3+ MB

Analysis plan

Initial detailed exploration of the data (focusing on the main variables involved) will be followed by data cleaning and some wrangling. Identification of missing observations, data type conversions, etc. will be addressed in these initial steps. The variables involved to answer the largest research question include the “Question” and “Income” and “StratificationCategoryId1” columns, as well as the “Data_Value” column and the “YearStart” column. New columns that group some of these variables may also be created, such as grouping the “Income” values into “Low Income” and “High Income”. At this point there is no plan to bring in and merge any external data.

Once the data has been wrangled, it can be visualized in an assortment of plots for analysis, and summary statistics by group can be compiled as well. These visualizations and statistics will allow for insight as to how the relationship between income and physical activity compares between groups, as well as over time. From here, additional metrics can be run, plots created, etc. to explore and dive in further.

Dataset 2: Private sector AI activity indicators

Introduction and data

The dataset was constructed by the Emerging Technology Observatory (ETO), a project of the Center for Security and Emerging Technology at Georgetown University. The information about company stages of development and company metadata was generated using data from Crunchbase, a tool for business insights. Publications data was obtained from another project by the ETO, their Merged Academic Corpus, which contains information about scholarly articles. Machine learning models were used to identify AI related publications. Patent information was obtained from 1790 Analytics, PATSTAT, and The Lens. Workforce related data was compiled and processed from LinkedIn profiles by Revelio Labs.

Description of contents

The dataset contains indicators of business and R&D activity related to AI for many private sector companies. The dataset contains 5 tables, core, ticker, alias, id, and yearly_publication_counts. We chose to focus on the core table for the purposes of this project.

The core table includes metadata and metrics related to publication, patent, and worker data regarding AI. The metadata includes a company name, unique numeric id, region of the world, stage of development, business sector, and description of the company, among other information.

The table also contains counts of AI research publications, average percent increase in publications per year over the past 3 years, percent of total publications that were related to AI, and metrics related to citations and presence in conferences. The counts of publications are also divided into the categories of computer vision, NLP, and robotics.

The patent information is separated into total patents, average percent increase in patents per year over the past 3 years, percentage of total patents related to AI, and granted patents. The patents are separated into subcategories. There are 2 main subcategories, ‘AI use cases’ and ‘AI applications and techniques’, which are further divided into more subcategories. These include energy, transportation, security for ‘AI use cases’, and computer vision, language processing, and distributed AI for ‘AI applications and techniques’.

The workforce data consists of 2 columns containing the number of AI workers and ‘Tech Team 1’ workers. Tech Team 1 is defined as “anyone with technical skills and a reasonable probability of working with AI” and is derived from Revelio Labs’ taxonomy of highly technical roles and their responsibilities. More information on this definition can be found in the dataset’s documentation. The AI workers metric is defined as a subset of Tech Team 1 with a high probability of working with AI.

Ethics of working with this dataset

The data was acquired from secondary sources (Crunchbase, PATSTAT, LinkedIn) so users of this dataset must be aware of the limitations of each individual source to make informed ethical decisions when making inferences using it. This could include ethical concerns regarding the use of LinkedIn users’ data, when these users did not explicitly consent to it being used for this purpose, although they did agree with the site’s terms of service.

The dataset authors address some limitations of the dataset in the description and there may be some ethical concerns related to making broad generalizations based on this dataset, which does not represent the entire population of all companies working with AI. One concern is that the workforce metrics are incomplete because they are based on LinkedIn data. As a result, the dataset could be biased towards United States residents and against other regions of the world, as LinkedIn is less popular in other countries and is blocked in some countries like Russia and China.

Another ethical concern could be that the dataset is heavily focused on large and established companies, and neglects other areas of AI development such as small companies, non-patented innovation, and open source projects. The dataset is also heavily reliant on publicly available and self reported data, which could make less public companies underrepresented such as companies who are doing work related to sensitive topics or are focused on proprietary innovations.

There may also be bias in the model used to identify AI-related publications, which could lead to an incomplete or misleading representation of actual AI research activity.

Research Question

  • What relationships are there between the number of AI publications and patents and the sector of the companies represented in this dataset? 

    • Target population: Companies represented in dataset
  • This question is important because AI is a rapidly growing field of research and investment. Understanding the sectors in which the most innovation is occurring can help guide business decisions of companies and investors. For example, investors who are interested in new technologies may want to invest in companies with higher numbers of AI publications and patents, while investors who have ethical concerns regarding the use of AI may be interested in fields with less development in this area. 

  • The topic of research is private-sector AI research activity, specifically related to the number of patents and research publications.

  • We hypothesize that the sectors ‘Software and IT Services”, ‘Banking and Investment Services’, and ‘Financial Technology (Fintech) & Infrastructure’ will have high numbers of AI related publications and patents, while ‘Food & Beverages’, ‘Food & Drug Retailing’, and ‘Personal & Household Products & Services’ will have lower levels of publications and patents.

  • Most of the variables are numeric and quantitative as they represent numerical quantities describing levels of AI related business activity, such as the variables in the subcategories Publications, Patents, and Workforce. The categorical qualitative variables include Name, Country, Region, and Sector.

Glimpse of data

ai = pd.read_csv("data/private_sector_ai_indicators.csv")
ai.head()
Name ID Country Website Groups Aggregated subsidiaries Region Stage Sector Description ... Patents: AI applications and techniques: Language processing Patents: AI applications and techniques: Measuring and testing Patents: AI applications and techniques: Planning and scheduling Patents: AI applications and techniques: Robotics Patents: AI applications and techniques: Speech processing Workforce: AI workers Workforce: Tech Team 1 workers City State/province PARAT link
0 Accenture 803 Ireland https://www.accenture.com/ S&P 500 NaN Europe Mature Software & IT Services Accenture is a professional services company, ... ... 33 13 129 0 23 13610 166212 Dublin Dublin https://parat.eto.tech/company/803-accenture
1 Cognizant 806 United States https://www.cognizant.com S&P 500 NaN North America Mature Software & IT Services Cognizant is a professional services company, ... ... 1 0 7 0 5 5226 130530 Teaneck New Jersey https://parat.eto.tech/company/806-cognizant
2 Amazon 23 United States http://amazon.com S&P 500, Global Big Tech Amazon Advertising, Amazon Web Services North America Mature Retailers Amazon is a global tech firm with a focus on e... ... 23 179 131 9 265 14164 128587 Seattle Washington https://parat.eto.tech/company/23-amazon
3 IBM 115 United States http://www.ibm.com/ S&P 500, Global Big Tech NaN North America Mature Software & IT Services IBM is an IT technology and consulting firm pr... ... 386 296 828 4 471 6113 117515 Armonk New York https://parat.eto.tech/company/115-ibm
4 Microsoft 163 United States http://www.microsoft.com S&P 500, Global Big Tech, GenAI Contenders NaN North America Mature Software & IT Services Microsoft is a software corporation that devel... ... 214 61 550 0 365 5245 104414 Redmond Washington https://parat.eto.tech/company/163-microsoft

5 rows × 61 columns

ai.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 61 columns):
 #   Column                                                             Non-Null Count  Dtype  
---  ------                                                             --------------  -----  
 0   Name                                                               691 non-null    object 
 1   ID                                                                 691 non-null    int64  
 2   Country                                                            690 non-null    object 
 3   Website                                                            690 non-null    object 
 4   Groups                                                             529 non-null    object 
 5   Aggregated subsidiaries                                            15 non-null     object 
 6   Region                                                             690 non-null    object 
 7   Stage                                                              691 non-null    object 
 8   Sector                                                             691 non-null    object 
 9   Description                                                        686 non-null    object 
 10  Description source                                                 686 non-null    object 
 11  Description link                                                   686 non-null    object 
 12  Description date                                                   686 non-null    object 
 13  Publications: AI publications                                      691 non-null    int64  
 14  Publications: Recent AI publication growth                         336 non-null    float64
 15  Publications: AI publication percentage                            691 non-null    float64
 16  Publications: AI publications in top conferences                   691 non-null    int64  
 17  Publications: Citations to AI research                             691 non-null    int64  
 18  Publications: CV publications                                      691 non-null    int64  
 19  Publications: NLP publications                                     691 non-null    int64  
 20  Publications: Robotics publications                                691 non-null    int64  
 21  Publications: Total publications                                   691 non-null    int64  
 22  Patents: AI patents                                                691 non-null    int64  
 23  Patents: AI patents: recent growth                                 331 non-null    float64
 24  Patents: AI patent percentage                                      691 non-null    float64
 25  Patents: Granted AI patents                                        691 non-null    int64  
 26  Patents: Total patents                                             691 non-null    int64  
 27  Patents: AI use cases: Agriculture                                 691 non-null    int64  
 28  Patents: AI use cases: Banking and finance                         691 non-null    int64  
 29  Patents: AI use cases: Business                                    691 non-null    int64  
 30  Patents: AI use cases: Computing in government                     691 non-null    int64  
 31  Patents: AI use cases: Document management and publishing          691 non-null    int64  
 32  Patents: AI use cases: Education                                   691 non-null    int64  
 33  Patents: AI use cases: Energy                                      691 non-null    int64  
 34  Patents: AI use cases: Entertainment                               691 non-null    int64  
 35  Patents: AI use cases: Industry and manufacturing                  691 non-null    int64  
 36  Patents: AI use cases: Life sciences                               691 non-null    int64  
 37  Patents: AI use cases: Military                                    691 non-null    int64  
 38  Patents: AI use cases: Nanotechnology                              691 non-null    int64  
 39  Patents: AI use cases: Networking                                  691 non-null    int64  
 40  Patents: AI use cases: Personal devices and computing              691 non-null    int64  
 41  Patents: AI use cases: Physical sciences and engineering           691 non-null    int64  
 42  Patents: AI use cases: Security                                    691 non-null    int64  
 43  Patents: AI use cases: Semiconductors                              691 non-null    int64  
 44  Patents: AI use cases: Telecommunications                          691 non-null    int64  
 45  Patents: AI use cases: Transportation                              691 non-null    int64  
 46  Patents: AI applications and techniques: Analytics and algorithms  691 non-null    int64  
 47  Patents: AI applications and techniques: Computer vision           691 non-null    int64  
 48  Patents: AI applications and techniques: Control                   691 non-null    int64  
 49  Patents: AI applications and techniques: Distributed AI            691 non-null    int64  
 50  Patents: AI applications and techniques: Knowledge representation  691 non-null    int64  
 51  Patents: AI applications and techniques: Language processing       691 non-null    int64  
 52  Patents: AI applications and techniques: Measuring and testing     691 non-null    int64  
 53  Patents: AI applications and techniques: Planning and scheduling   691 non-null    int64  
 54  Patents: AI applications and techniques: Robotics                  691 non-null    int64  
 55  Patents: AI applications and techniques: Speech processing         691 non-null    int64  
 56  Workforce: AI workers                                              691 non-null    int64  
 57  Workforce: Tech Team 1 workers                                     691 non-null    int64  
 58  City                                                               690 non-null    object 
 59  State/province                                                     679 non-null    object 
 60  PARAT link                                                         691 non-null    object 
dtypes: float64(4), int64(42), object(15)
memory usage: 329.4+ KB

Analysis plan

Some data cleaning will need to take place to identify missing data and outliers. Some variables that may require significant decision making about the scope of the data used to answer the research question are Publications: Recent AI publication growth and Patents: AI patents: recent growth, which each only have 331 rows compared to the 691 for most other variables. The variables involved in answering the research question are the company name/ID, variables within the Publications and Patents categories. There are no plans to integrate any other data sources, but if necessary the other tables in the dataset may contain information of use. After the data cleaning, wrangling, and EDA prcoess the data can be visualized and numerical statistics can be computed to understand the relationships between publications/patents and sector.

Dataset 3: Bachelor’s degrees by field of study

Introduction and data

The Digest of Education Statistics includes data summarized from surveys administered by the National Center for Education Statistics (NCES) (and other government agencies) and Annual Reports. This data was collected by surveys administered to postsecondary institutions participating in Title IV federal financial aid programs. The tables can be found here

Description of contents

There are 6 datasets that will be loaded, providing a combination of data on bachelor’s degrees conferred by either field of study, state, sex, or race/ethnicity. The datasets provide data either from a range of time (~1970-2021) or only two years (2019/20-2020/21). Below is an outline of each table and its contents.

  • Table 319.30. Bachelor’s degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21
  • Table 322.10. Bachelor’s degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21
  • Table 322.20. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21
  • Table 322.30. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
  • Table 322.40. Bachelor’s degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
  • Table 322.50. Bachelor’s degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21

Ethical Concerns

Because the data was collected from surveys, it is possible that there could be missing patterns from universities that did not participate. Otherwise, there are no other glaring ethical concerns with the data.

Research Question

  • How do gender and racial/ethnic diversity levels compare between those in newer and fast growing majors and those in more well established and traditional majors?

    • The target audience are universities and colleges.
  • This question is important because it examines diversity trends within academia. The shifts in representation in a field can highlight progress in educational inclusion and access. Furthermore, this question can inform policies or awarness to support diversity in fields lacking representation.

  • We hypothesize that newer, fast-growing majors have higher levels of diversity than more traditional majors.

  • The categorical variables are the field of study, gender, and race/ethnicity. The quantitative variables include the year, and counts for each respective field of study and year or demographic feature

Glimpse of data

Table 319.30. Bachelor’s degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21

tabn319_30 = pd.read_excel("data/education/tabn319.30.xlsx")
tabn319_30.head()
Table 319.30. Bachelor's degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11
0 State or jurisdiction Total Humanities\1\ Psychology Social sciences and history Natural sciences and mathematics\2\ Computer and information sciences and support ... Engineering\3\ Education Business\4\ Health professions and related programs Other fields\5\
1 1 2 3 4 5 6 7 8 9 10 11 12
2 United States 2066445 263894 126944 160827 187829 104874 145041 89398 391375 268018 328245
3 Alabama 34821 2785 1523 1593 2851 1195 3832 1813 8272 4513 6444
4 Alaska 1812 273 110 124 159 29 166 57 320 261 313

Table 322.10. Bachelor’s degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21

tabn322_10 = pd.read_excel("data/education/tabn322.10.xlsx")
tabn322_10.head()
Table 322.10. Bachelor's degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18
0 Field of study 1970-71 1975-76 1980-81 1985-86 1990-91 1995-96 2000-01 2005-06 2011-12 2012-13 2013-14 2014-15 2015-16 2016-17 2017-18 2018-19 2019-20 2020-21
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 Total 839730 925746 935140 987823 1094538 1164792 1244171 1485104 1792163 1840381 1870150 1894969 1920750 1956114 1980665 2013086 2038682 2066445
3 Agriculture and natural resources\1\ 12674 19402 21886 17191 13363 21757 23766 23497 31629 34304 35953 37028 37827 38782 40334 41373 41858 41925
4 Architecture and related services 5570 9146 9455 9119 9781 8352 8480 9515 9727 9757 9149 9090 8825 8579 8464 8806 9045 9296

Table 322.20. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21

tabn322_20 = pd.read_excel("data/education/tabn322.20.xlsx")
tabn322_20.head()
Table 322.20. Bachelor's degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16
0 Year and sex Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Number of degrees conferred to U.S. citizens, ... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t... Percentage distribution of degrees conferred t...
1 Year and sex Total Total White Black Hispanic Asian/Pacific Islander American Indian/Alaska Native Two or more races\1\ Nonresident Total White Black Hispanic Asian/Pacific Islander American Indian/Alaska Native Two or more races\1\
2 1 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3 Total NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1976-77 919549 \2,3\ 807688 58636 18743 13793 3326 --- 15714 100 89.525663 6.499325 2.07751 1.528842 0.36866 ---

Table 322.30. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21

tabn322_30 = pd.read_excel("data/education/tabn322.30.xlsx")
tabn322_30.head()
Table 322.30. Bachelor's degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20
0 Field of study 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 ... 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21
1 Field of study Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races Nonresident
2 Field of study Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races Nonresident
3 1 2 3 4 5 6 7 8 9 10 ... 12 13 14 15 16 17 18 19 20 21
4 All fields, total 2038682 1184082 197491 302663 161468 157085 4383 9154 77621 ... 2066445 1172187 206527 324848 169261 164845 4416 9545 81369 102708

5 rows × 21 columns

Table 322.40. Bachelor’s degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21

tabn322_40 = pd.read_excel("data/education/tabn322.40.xlsx")
tabn322_40.head()
Table 322.40. Bachelor's degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20
0 Field of study 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 ... 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21
1 Field of study Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races Nonresident
2 Field of study Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races Nonresident
3 1 2 3 4 5 6 7 8 9 10 ... 12 13 14 15 16 17 18 19 20 21
4 All fields, total 861384 509079 70346 117230 72916 71005 1911 3344 31620 ... 860764 499092 72092 123256 75704 73835 1869 3407 33003 54210

5 rows × 21 columns

Table 322.50. Bachelor’s degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21

tabn322_50 = pd.read_excel("data/education/tabn322.50.xlsx")
tabn322_50.head()
Table 322.50. Bachelor's degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20
0 Field of study 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 2019-20 ... 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21 2020-21
1 Field of study Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Asian/Pacific Islander Asian/Pacific Islander Asian/Pacific Islander American Indian/Alaska Native Two or more races Nonresident
2 Field of study Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races ... Total White Black Hispanic Total Asian Pacific Islander American Indian/Alaska Native Two or more races Nonresident
3 1 2 3 4 5 6 7 8 9 10 ... 12 13 14 15 16 17 18 19 20 21
4 All fields, total 1177298 675003 127145 185433 88552 86080 2472 5810 46001 ... 1205681 673095 134435 201592 93557 91010 2547 6138 48366 48498

5 rows × 21 columns

Analysis plan

The tables will need to be integrated into a tabular format containing the information needed to answer the research question. There is little to no missing data and the individual tables are already highly structured but there will likely be significant wrangling needed to integrate the tables into a usable format. After the tables are integrated, summary statistics can be calculated and the data can be visualized to answer the research question.

Once the tables are in a usable format, there are two questions that will need to be answered. First, we will need to identify what majors are traditional and which would be considered fast-growing. That is, which majors have historically and continue to be conferred highly, compared to those that only recently have begun to increase in numbers. From here, then we will investigate the demographic makeup of graduates for different fields of study. Because there are a significant number of majors listed, only a select few majors of interest (i.e. the most traditional and most new majors) will be looked at.