Data, Data, Data
Oh My
Dataset 1: Nutrition, physical activity, and obesity
Introduction and data
This dataset was provided by the Centers for Disease Control and Provention (CDC), National Center for Chronic Disease prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity. This data was collected through health-related telephone surveys that gather state data about U.S. Residents. This dataset has been used for the Data, Trends, and Maps database that the Division of Nutrition, Physical Activity, and Obesity (DNPAO) section of the CDC has, which is responsible for providing both state and national data for these topics.
Description of contents
This dataset includes over 104 thousand rows, and has 33 columns. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total.
Survey questions fall into the categories of “Fruits and Vegetables - Behavior”, “Obesity/Weight Status”, and “Physical Activity - Behavior”. Examples of survey questions include “Percent of adults who engage in muscle-strengthening activities on 2 or more days a week” and “Percent of adults aged 18 years and older who have obesity”.
This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.
Ethical Concerns
There are no particular ethical concerns regarding working with this data. This dataset is aggregated, and numbers are excluded in instances where the sample size is too small. This removes concerns surrounding the personal identification of individuals within this dataset. This dataset is publicly available for anyone to download, and the licensing agreement states that it is free to be shared, created, and adapted, as long as it is attributed as the data source when publicly displayed or published. This removes concerns surrounding unfair or illegal acquisition and use of the data.
Research Question
Research Question: Do higher-income populations consistently have more time for physical activity than lower income populations?
Additional research questions include: How has the relationship between amount of physical activity and income changed over time? How does this vary between groups? And how does the amount of physical activity that lower-income populations have the time to do changed over the years?
The target population for this research question is U.S. Residents 18 and over, represented by the dataset
This question is important because it may highlight areas that correlate with differences in physical health across the population. If there are groupings that are identified that are tied to physical health and activity more than others, then more research can be done to identify ways in which these groups can receive more assistance with nutrition and adopting healthier lifestyles.
The research topic of interest here is whether or not there is a relationship between the amount of income that an individual makes and the amount of physical activity they are able to make time for. This falls in a larger category of interest surrounding differences between physical activity and nutrition for different groups and subsets of the population.
We hypothesize that individuals who have lower income levels will have less time for physical activity, showing that larger percentages of the low income population will fall into the “Percent of adults who engage in no leisure-time physical activity” group compared to individuals with higher incomes.
The variables in this research question are mostly categorical, the questions themselves and the groupings (income range) are both categorical. The percentage of respondents who fall into each category is a quantitative variable, and there is a time element (years) which can be used as well.
Glimpse of data
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value_Unit | Data_Value_Type | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Race/Ethnicity | 2 or more races | RACE | RACE2PLUS |
1 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Race/Ethnicity | Other | RACE | RACEOTH |
2 | 2011 | 2011 | AK | Alaska | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 150 min... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | PA | PA1 | Q044 | VALUE | 2 | Gender | Female | GEN | FEMALE |
3 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q036 | VALUE | 2 | Age (years) | 35 - 44 | AGEYR | AGEYR3544 |
4 | 2011 | 2011 | AK | Alaska | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 2011.0 | Value | ... | (64.845079957001, -147.722059036) | OWS | OWS1 | Q037 | VALUE | 2 | Income | $15,000 - $24,999 | INC | INC1525 |
5 rows × 33 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104272 entries, 0 to 104271
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearStart 104272 non-null int64
1 YearEnd 104272 non-null int64
2 LocationAbbr 104272 non-null object
3 LocationDesc 104272 non-null object
4 Datasource 104272 non-null object
5 Class 104272 non-null object
6 Topic 104272 non-null object
7 Question 104272 non-null object
8 Data_Value_Unit 88872 non-null float64
9 Data_Value_Type 104272 non-null object
10 Data_Value 93505 non-null float64
11 Data_Value_Alt 93505 non-null float64
12 Data_Value_Footnote_Symbol 10767 non-null object
13 Data_Value_Footnote 10767 non-null object
14 Low_Confidence_Limit 93505 non-null float64
15 High_Confidence_Limit 93505 non-null float64
16 Sample_Size 93505 non-null float64
17 Total 3724 non-null object
18 Age(years) 22344 non-null object
19 Education 14896 non-null object
20 Gender 7448 non-null object
21 Income 26068 non-null object
22 Race/Ethnicity 29792 non-null object
23 GeoLocation 102340 non-null object
24 ClassID 104272 non-null object
25 TopicID 104272 non-null object
26 QuestionID 104272 non-null object
27 DataValueTypeID 104272 non-null object
28 LocationID 104272 non-null int64
29 StratificationCategory1 104272 non-null object
30 Stratification1 104272 non-null object
31 StratificationCategoryId1 104272 non-null object
32 StratificationID1 104272 non-null object
dtypes: float64(6), int64(3), object(24)
memory usage: 26.3+ MB
Analysis plan
Initial detailed exploration of the data (focusing on the main variables involved) will be followed by data cleaning and some wrangling. Identification of missing observations, data type conversions, etc. will be addressed in these initial steps. The variables involved to answer the largest research question include the “Question” and “Income” and “StratificationCategoryId1” columns, as well as the “Data_Value” column and the “YearStart” column. New columns that group some of these variables may also be created, such as grouping the “Income” values into “Low Income” and “High Income”. At this point there is no plan to bring in and merge any external data.
Once the data has been wrangled, it can be visualized in an assortment of plots for analysis, and summary statistics by group can be compiled as well. These visualizations and statistics will allow for insight as to how the relationship between income and physical activity compares between groups, as well as over time. From here, additional metrics can be run, plots created, etc. to explore and dive in further.
Dataset 2: Private sector AI activity indicators
Introduction and data
The dataset was constructed by the Emerging Technology Observatory (ETO), a project of the Center for Security and Emerging Technology at Georgetown University. The information about company stages of development and company metadata was generated using data from Crunchbase, a tool for business insights. Publications data was obtained from another project by the ETO, their Merged Academic Corpus, which contains information about scholarly articles. Machine learning models were used to identify AI related publications. Patent information was obtained from 1790 Analytics, PATSTAT, and The Lens. Workforce related data was compiled and processed from LinkedIn profiles by Revelio Labs.
Description of contents
The dataset contains indicators of business and R&D activity related to AI for many private sector companies. The dataset contains 5 tables, core, ticker, alias, id, and yearly_publication_counts. We chose to focus on the core table for the purposes of this project.
The core table includes metadata and metrics related to publication, patent, and worker data regarding AI. The metadata includes a company name, unique numeric id, region of the world, stage of development, business sector, and description of the company, among other information.
The table also contains counts of AI research publications, average percent increase in publications per year over the past 3 years, percent of total publications that were related to AI, and metrics related to citations and presence in conferences. The counts of publications are also divided into the categories of computer vision, NLP, and robotics.
The patent information is separated into total patents, average percent increase in patents per year over the past 3 years, percentage of total patents related to AI, and granted patents. The patents are separated into subcategories. There are 2 main subcategories, ‘AI use cases’ and ‘AI applications and techniques’, which are further divided into more subcategories. These include energy, transportation, security for ‘AI use cases’, and computer vision, language processing, and distributed AI for ‘AI applications and techniques’.
The workforce data consists of 2 columns containing the number of AI workers and ‘Tech Team 1’ workers. Tech Team 1 is defined as “anyone with technical skills and a reasonable probability of working with AI” and is derived from Revelio Labs’ taxonomy of highly technical roles and their responsibilities. More information on this definition can be found in the dataset’s documentation. The AI workers metric is defined as a subset of Tech Team 1 with a high probability of working with AI.
Ethics of working with this dataset
The data was acquired from secondary sources (Crunchbase, PATSTAT, LinkedIn) so users of this dataset must be aware of the limitations of each individual source to make informed ethical decisions when making inferences using it. This could include ethical concerns regarding the use of LinkedIn users’ data, when these users did not explicitly consent to it being used for this purpose, although they did agree with the site’s terms of service.
The dataset authors address some limitations of the dataset in the description and there may be some ethical concerns related to making broad generalizations based on this dataset, which does not represent the entire population of all companies working with AI. One concern is that the workforce metrics are incomplete because they are based on LinkedIn data. As a result, the dataset could be biased towards United States residents and against other regions of the world, as LinkedIn is less popular in other countries and is blocked in some countries like Russia and China.
Another ethical concern could be that the dataset is heavily focused on large and established companies, and neglects other areas of AI development such as small companies, non-patented innovation, and open source projects. The dataset is also heavily reliant on publicly available and self reported data, which could make less public companies underrepresented such as companies who are doing work related to sensitive topics or are focused on proprietary innovations.
There may also be bias in the model used to identify AI-related publications, which could lead to an incomplete or misleading representation of actual AI research activity.
Research Question
What relationships are there between the number of AI publications and patents and the sector of the companies represented in this dataset?
- Target population: Companies represented in dataset
This question is important because AI is a rapidly growing field of research and investment. Understanding the sectors in which the most innovation is occurring can help guide business decisions of companies and investors. For example, investors who are interested in new technologies may want to invest in companies with higher numbers of AI publications and patents, while investors who have ethical concerns regarding the use of AI may be interested in fields with less development in this area.
The topic of research is private-sector AI research activity, specifically related to the number of patents and research publications.
We hypothesize that the sectors ‘Software and IT Services”, ‘Banking and Investment Services’, and ‘Financial Technology (Fintech) & Infrastructure’ will have high numbers of AI related publications and patents, while ‘Food & Beverages’, ‘Food & Drug Retailing’, and ‘Personal & Household Products & Services’ will have lower levels of publications and patents.
Most of the variables are numeric and quantitative as they represent numerical quantities describing levels of AI related business activity, such as the variables in the subcategories Publications, Patents, and Workforce. The categorical qualitative variables include Name, Country, Region, and Sector.
Glimpse of data
Name | ID | Country | Website | Groups | Aggregated subsidiaries | Region | Stage | Sector | Description | ... | Patents: AI applications and techniques: Language processing | Patents: AI applications and techniques: Measuring and testing | Patents: AI applications and techniques: Planning and scheduling | Patents: AI applications and techniques: Robotics | Patents: AI applications and techniques: Speech processing | Workforce: AI workers | Workforce: Tech Team 1 workers | City | State/province | PARAT link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Accenture | 803 | Ireland | https://www.accenture.com/ | S&P 500 | NaN | Europe | Mature | Software & IT Services | Accenture is a professional services company, ... | ... | 33 | 13 | 129 | 0 | 23 | 13610 | 166212 | Dublin | Dublin | https://parat.eto.tech/company/803-accenture |
1 | Cognizant | 806 | United States | https://www.cognizant.com | S&P 500 | NaN | North America | Mature | Software & IT Services | Cognizant is a professional services company, ... | ... | 1 | 0 | 7 | 0 | 5 | 5226 | 130530 | Teaneck | New Jersey | https://parat.eto.tech/company/806-cognizant |
2 | Amazon | 23 | United States | http://amazon.com | S&P 500, Global Big Tech | Amazon Advertising, Amazon Web Services | North America | Mature | Retailers | Amazon is a global tech firm with a focus on e... | ... | 23 | 179 | 131 | 9 | 265 | 14164 | 128587 | Seattle | Washington | https://parat.eto.tech/company/23-amazon |
3 | IBM | 115 | United States | http://www.ibm.com/ | S&P 500, Global Big Tech | NaN | North America | Mature | Software & IT Services | IBM is an IT technology and consulting firm pr... | ... | 386 | 296 | 828 | 4 | 471 | 6113 | 117515 | Armonk | New York | https://parat.eto.tech/company/115-ibm |
4 | Microsoft | 163 | United States | http://www.microsoft.com | S&P 500, Global Big Tech, GenAI Contenders | NaN | North America | Mature | Software & IT Services | Microsoft is a software corporation that devel... | ... | 214 | 61 | 550 | 0 | 365 | 5245 | 104414 | Redmond | Washington | https://parat.eto.tech/company/163-microsoft |
5 rows × 61 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 691 non-null object
1 ID 691 non-null int64
2 Country 690 non-null object
3 Website 690 non-null object
4 Groups 529 non-null object
5 Aggregated subsidiaries 15 non-null object
6 Region 690 non-null object
7 Stage 691 non-null object
8 Sector 691 non-null object
9 Description 686 non-null object
10 Description source 686 non-null object
11 Description link 686 non-null object
12 Description date 686 non-null object
13 Publications: AI publications 691 non-null int64
14 Publications: Recent AI publication growth 336 non-null float64
15 Publications: AI publication percentage 691 non-null float64
16 Publications: AI publications in top conferences 691 non-null int64
17 Publications: Citations to AI research 691 non-null int64
18 Publications: CV publications 691 non-null int64
19 Publications: NLP publications 691 non-null int64
20 Publications: Robotics publications 691 non-null int64
21 Publications: Total publications 691 non-null int64
22 Patents: AI patents 691 non-null int64
23 Patents: AI patents: recent growth 331 non-null float64
24 Patents: AI patent percentage 691 non-null float64
25 Patents: Granted AI patents 691 non-null int64
26 Patents: Total patents 691 non-null int64
27 Patents: AI use cases: Agriculture 691 non-null int64
28 Patents: AI use cases: Banking and finance 691 non-null int64
29 Patents: AI use cases: Business 691 non-null int64
30 Patents: AI use cases: Computing in government 691 non-null int64
31 Patents: AI use cases: Document management and publishing 691 non-null int64
32 Patents: AI use cases: Education 691 non-null int64
33 Patents: AI use cases: Energy 691 non-null int64
34 Patents: AI use cases: Entertainment 691 non-null int64
35 Patents: AI use cases: Industry and manufacturing 691 non-null int64
36 Patents: AI use cases: Life sciences 691 non-null int64
37 Patents: AI use cases: Military 691 non-null int64
38 Patents: AI use cases: Nanotechnology 691 non-null int64
39 Patents: AI use cases: Networking 691 non-null int64
40 Patents: AI use cases: Personal devices and computing 691 non-null int64
41 Patents: AI use cases: Physical sciences and engineering 691 non-null int64
42 Patents: AI use cases: Security 691 non-null int64
43 Patents: AI use cases: Semiconductors 691 non-null int64
44 Patents: AI use cases: Telecommunications 691 non-null int64
45 Patents: AI use cases: Transportation 691 non-null int64
46 Patents: AI applications and techniques: Analytics and algorithms 691 non-null int64
47 Patents: AI applications and techniques: Computer vision 691 non-null int64
48 Patents: AI applications and techniques: Control 691 non-null int64
49 Patents: AI applications and techniques: Distributed AI 691 non-null int64
50 Patents: AI applications and techniques: Knowledge representation 691 non-null int64
51 Patents: AI applications and techniques: Language processing 691 non-null int64
52 Patents: AI applications and techniques: Measuring and testing 691 non-null int64
53 Patents: AI applications and techniques: Planning and scheduling 691 non-null int64
54 Patents: AI applications and techniques: Robotics 691 non-null int64
55 Patents: AI applications and techniques: Speech processing 691 non-null int64
56 Workforce: AI workers 691 non-null int64
57 Workforce: Tech Team 1 workers 691 non-null int64
58 City 690 non-null object
59 State/province 679 non-null object
60 PARAT link 691 non-null object
dtypes: float64(4), int64(42), object(15)
memory usage: 329.4+ KB
Analysis plan
Some data cleaning will need to take place to identify missing data and outliers. Some variables that may require significant decision making about the scope of the data used to answer the research question are Publications: Recent AI publication growth
and Patents: AI patents: recent growth
, which each only have 331 rows compared to the 691 for most other variables. The variables involved in answering the research question are the company name/ID, variables within the Publications and Patents categories. There are no plans to integrate any other data sources, but if necessary the other tables in the dataset may contain information of use. After the data cleaning, wrangling, and EDA prcoess the data can be visualized and numerical statistics can be computed to understand the relationships between publications/patents and sector.
Dataset 3: Bachelor’s degrees by field of study
Introduction and data
The Digest of Education Statistics includes data summarized from surveys administered by the National Center for Education Statistics (NCES) (and other government agencies) and Annual Reports. This data was collected by surveys administered to postsecondary institutions participating in Title IV federal financial aid programs. The tables can be found here
Description of contents
There are 6 datasets that will be loaded, providing a combination of data on bachelor’s degrees conferred by either field of study, state, sex, or race/ethnicity. The datasets provide data either from a range of time (~1970-2021) or only two years (2019/20-2020/21). Below is an outline of each table and its contents.
- Table 319.30. Bachelor’s degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21
- Table 322.10. Bachelor’s degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21
- Table 322.20. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21
- Table 322.30. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
- Table 322.40. Bachelor’s degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
- Table 322.50. Bachelor’s degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
Ethical Concerns
Because the data was collected from surveys, it is possible that there could be missing patterns from universities that did not participate. Otherwise, there are no other glaring ethical concerns with the data.
Research Question
How do gender and racial/ethnic diversity levels compare between those in newer and fast growing majors and those in more well established and traditional majors?
- The target audience are universities and colleges.
This question is important because it examines diversity trends within academia. The shifts in representation in a field can highlight progress in educational inclusion and access. Furthermore, this question can inform policies or awarness to support diversity in fields lacking representation.
We hypothesize that newer, fast-growing majors have higher levels of diversity than more traditional majors.
The categorical variables are the field of study, gender, and race/ethnicity. The quantitative variables include the year, and counts for each respective field of study and year or demographic feature
Glimpse of data
Table 319.30. Bachelor’s degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21
Table 319.30. Bachelor's degrees conferred by postsecondary institutions, by field of study and state or jurisdiction: Academic year 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | State or jurisdiction | Total | Humanities\1\ | Psychology | Social sciences and history | Natural sciences and mathematics\2\ | Computer and information sciences and support ... | Engineering\3\ | Education | Business\4\ | Health professions and related programs | Other fields\5\ |
1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
2 | United States | 2066445 | 263894 | 126944 | 160827 | 187829 | 104874 | 145041 | 89398 | 391375 | 268018 | 328245 |
3 | Alabama | 34821 | 2785 | 1523 | 1593 | 2851 | 1195 | 3832 | 1813 | 8272 | 4513 | 6444 |
4 | Alaska | 1812 | 273 | 110 | 124 | 159 | 29 | 166 | 57 | 320 | 261 | 313 |
Table 322.10. Bachelor’s degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21
Table 322.10. Bachelor's degrees conferred by postsecondary institutions, by field of study: Selected academic years, 1970-71 through 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Field of study | 1970-71 | 1975-76 | 1980-81 | 1985-86 | 1990-91 | 1995-96 | 2000-01 | 2005-06 | 2011-12 | 2012-13 | 2013-14 | 2014-15 | 2015-16 | 2016-17 | 2017-18 | 2018-19 | 2019-20 | 2020-21 |
1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
2 | Total | 839730 | 925746 | 935140 | 987823 | 1094538 | 1164792 | 1244171 | 1485104 | 1792163 | 1840381 | 1870150 | 1894969 | 1920750 | 1956114 | 1980665 | 2013086 | 2038682 | 2066445 |
3 | Agriculture and natural resources\1\ | 12674 | 19402 | 21886 | 17191 | 13363 | 21757 | 23766 | 23497 | 31629 | 34304 | 35953 | 37028 | 37827 | 38782 | 40334 | 41373 | 41858 | 41925 |
4 | Architecture and related services | 5570 | 9146 | 9455 | 9119 | 9781 | 8352 | 8480 | 9515 | 9727 | 9757 | 9149 | 9090 | 8825 | 8579 | 8464 | 8806 | 9045 | 9296 |
Table 322.20. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21
Table 322.20. Bachelor's degrees conferred by postsecondary institutions, by race/ethnicity and sex of student: Selected academic years, 1976-77 through 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Year and sex | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Number of degrees conferred to U.S. citizens, ... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... | Percentage distribution of degrees conferred t... |
1 | Year and sex | Total | Total | White | Black | Hispanic | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races\1\ | Nonresident | Total | White | Black | Hispanic | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races\1\ |
2 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
3 | Total | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1976-77 | 919549 | \2,3\ | 807688 | 58636 | 18743 | 13793 | 3326 | --- | 15714 | 100 | 89.525663 | 6.499325 | 2.07751 | 1.528842 | 0.36866 | --- |
Table 322.30. Bachelor’s degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
Table 322.30. Bachelor's degrees conferred by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Field of study | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | ... | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 |
1 | Field of study | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
2 | Field of study | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
4 | All fields, total | 2038682 | 1184082 | 197491 | 302663 | 161468 | 157085 | 4383 | 9154 | 77621 | ... | 2066445 | 1172187 | 206527 | 324848 | 169261 | 164845 | 4416 | 9545 | 81369 | 102708 |
5 rows × 21 columns
Table 322.40. Bachelor’s degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
Table 322.40. Bachelor's degrees conferred to males by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Field of study | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | ... | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 |
1 | Field of study | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
2 | Field of study | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
4 | All fields, total | 861384 | 509079 | 70346 | 117230 | 72916 | 71005 | 1911 | 3344 | 31620 | ... | 860764 | 499092 | 72092 | 123256 | 75704 | 73835 | 1869 | 3407 | 33003 | 54210 |
5 rows × 21 columns
Table 322.50. Bachelor’s degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21
Table 322.50. Bachelor's degrees conferred to females by postsecondary institutions, by race/ethnicity and field of study: Academic years 2019-20 and 2020-21 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Field of study | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | 2019-20 | ... | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 | 2020-21 |
1 | Field of study | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Asian/Pacific Islander | Asian/Pacific Islander | Asian/Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
2 | Field of study | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | ... | Total | White | Black | Hispanic | Total | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | Nonresident |
3 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
4 | All fields, total | 1177298 | 675003 | 127145 | 185433 | 88552 | 86080 | 2472 | 5810 | 46001 | ... | 1205681 | 673095 | 134435 | 201592 | 93557 | 91010 | 2547 | 6138 | 48366 | 48498 |
5 rows × 21 columns
Analysis plan
The tables will need to be integrated into a tabular format containing the information needed to answer the research question. There is little to no missing data and the individual tables are already highly structured but there will likely be significant wrangling needed to integrate the tables into a usable format. After the tables are integrated, summary statistics can be calculated and the data can be visualized to answer the research question.
Once the tables are in a usable format, there are two questions that will need to be answered. First, we will need to identify what majors are traditional and which would be considered fast-growing. That is, which majors have historically and continue to be conferred highly, compared to those that only recently have begun to increase in numbers. From here, then we will investigate the demographic makeup of graduates for different fields of study. Because there are a significant number of majors listed, only a select few majors of interest (i.e. the most traditional and most new majors) will be looked at.