Flight Matrix

INFO 511 - Fall 2024 - Final Project

Project description
Author
Affiliation

Lean Mean Learning Machines

School of Information, University of Arizona

Abstract

This study aims to analyze flight data from major Belgian airports using the flights.csv dataset, focusing on identifying patterns and trends in flight departures and arrivals. The primary research question investigates how the total number of monthly and yearly flight departures and arrivals at Brussels Airport compare to other airports in Belgium, and whether seasonal trends significantly impact flight volumes. This study also shows the flight distribution across Belgium’s airports. To achieve this, a comprehensive data exploration will be conducted, followed by data preprocessing to ensure consistency and accuracy.

Introduction

This project focus on analyzing the flight data from January 2016 to January 2020 for major airports in Belgium, with a focus on Brussels Airport, one of the largest and busiest in Europe. Specifically, this project investigates how the total number of flight activities at Brussels Airport compares to other airports in Belgium and examines the seasonal trends over the year. Insights from this study could assist airport management in anticipating seasonal changes, optimizing resource allocation, and improving operational efficiency.

Data Description

This data comes from Eurocontrol, an international organization for air traffic control management throughout Europe. Air Traffic Flow Management (ATFM) collects information on daily flights arriving and departing European airports (Commercial air transport in August 2021: in recovery - Products Eurostat News - Eurostat), and we sourced from the GitHub repository associated with the Tidytuesday project (tidytuesday/data/2022/2022-07-12/readme.md at main · rfordatascience/tidytuesday).  This data contains 15 columns and 7306 rows for flights from the 2016 to 2019. The columns contain information such as the date of the flight, airport designator, airport name, country of the airport, and the numbers of departures and arrivals to the airport.  

Key variables we have applied in our research are:

Quantative Variables:

  • FLT_DEP_1 - Contains total number of departing flights

  • FLT_ARR_1 - Contains total number of arriving flights

Categorical Variables:

  • APT_NAME - Contains airport names
  • STATE_NAME - Contains name of the country
  • MONTH_MON - Contains the month in which the records of departures and arrivals took place

Data Cleaning and Preprocessing

To ensure data integrity and analytical clarity, several preprocessing steps were undertaken:

  • Date Conversion: The variable YEAR_MONTH was converted to a date-time format for temporal analysis.

  • Derived Variables: New variables, such as total monthly flights (sum of departures and arrivals), were created to facilitate trend analysis.

  • Handling Missing Values: Observations with incomplete data for key variables were excluded.

  • Grouping and Aggregation: Data was aggregated by airport and by month and year to examine trends and seasonal patterns.

Exploratory Data Analysis

As is shown in Figure 1 below, Brussels Airport consistently showed the highest flight activity, highlighting its dominant role in Belgian air traffic. The chart also exhibits clear seasonal peaks during the summer and declines in winter.

Figure 1. Monthly departures and arrivals at Brussels Airport and other major airports from 2016 to 2019.

Figure 2. The average monthly flight activity for Brussels Airport and other major airports.

Figure 2, the average monthly trend figure also presents strong seasonality at Brussels Airport, with a gradual increase from January to July, peaking in the summer, and a subsequent decline toward December.

These two figures confirm that there are clear correlations between monthly flight activity and seasonal trends were observed across airports.

Methodology

Overview of Analysis

This analysis provides seasonal trends and monthly flight activity at Brussels Airport compared to other major Belgian Airports for year (2016-2019). It includes exploratory data analysis, statistical tests and modeling to check differences in flight activity among the airports.

Seasonal Analysis

Goal: To find Seasonal trends in flight activity at Brussels Airport and compare them with other airports.

Steps:

  1. Combining the monthly data (like grouping departures, arrivals and total flights) for all years to show seasonal patterns.
  2. Calculating monthly averages for Brussels and other airports.
  3. Line Plots showing trends with peak in summer ad dips in winter.

This helps us to understand the travel trends and plan for busy or slow times. See Figure 1.

Total number monthly flight departures and arrivals at Brussels Airport

Goal: To Compare monthly flight activity at Brussels Airport with other major airports.

Steps:

  1. Combining the monthly data (like grouping departures, arrivals and total flights) for all years to show seasonal patterns.
  2. Comparing Brussels data with average monthly activity of other airports.
  3. Creating a bar and line plot to highlight difference in flight activity.

Figure 2 shows how Brussels dominate flight compared other airports compare for many consecutive years.

Hypothesis Testing

Goal: The hypothesis that we are testing in this section is to demonstrate that there are significant variations in flight distributions among Belgian airports. This test will validate whether the differences in flight activity distribution among airports are due to random chance or reflect meaningful patterns. We hypothesize that that the flight distribution patterns are not due to random chance.

Steps:

  1. Creating a table containing summarized flight activity by airport and month.
  2. Calculating the expected frequencies by assuming flight activity which is evenly distributed among the airports based in marginal totals in the contingency table.
  3. This test compared observed values with expected values using the formula.
  4. The test statistic and corresponding p-value were used to determine whether observed differences were statistically significant.
  5. Creating bar plot to compare the observed and expected frequencies for each airport making deviations clear.

Results

A. How does the total number of monthly flight departures and arrivals at Brussels Airport compare to other major airports in Belgium?

By aggregating the data by the variables YEAR_MONTH and APT_NAME, we were able to get all the total number of flights for departure, arrivals, and totals for each month by year for each airport. We then segregated the Brussels airport data from the rest of the other Belgium airports into two different sets for comparison. For the comparison, we performed a grouping by YEAR_MONTH for each of the sets which gave us a sum of the flights across each month for each year for Brussels and all other airports, respectively. When we place this data on a line graph to represent time, we get Figure 1. This graph highlights a few key aspects: 1) Brussels is the most popular airport by number of flights in Belgium with more than double the number of flights per month (with exception to April of 2016 - see Discussion for more information). 2) There appears to be seasonal patterns across the years of 2016 to 2019 for both Brussels airport as well as the rest of Belgium airports, where the peak season appear to be summer time and the lowest number of flights during winter.

C. Hypothesis testing

We used a chi-squared test to examine the flight distribution across different airports in Belgium. Our hypothesis in this section is that majority of the flights are in Brussels. In this section, we had decided to go with the chi-squared test instead of the h-test, as we have sufficient categories to perform upon. We chose to use the chi2_contingency function from the scipy.stats package, with the expected data being that flights are equally distributed across all airports in Belgium. The chi-squared function compared the expected to the observed, and received the following results:

  • Chi-Square Statistic: 2323.9454386944262

  • p-value: 0.0

With our p-value set at 0.05, this means that we are able to reject the null hypothesis and that our results show that the flights are not evenly distributed at a significant value of 0.0. The chi-squared statistic result of 2323.94 tells us that the differences that we see from the expected values are large.

D. Statistical Analysis

In this section, we show the differences between the expected and observed values for flight distributions. Figure 3 is a bar plot that shows the deviation from the expected difference of 0 from the expected numbers for each airport at each month of the year, labeled as month, airport, with an aggregation of the years 2016-2019.

Figure 3 displays a bar plot representing the amount of deviation that the observed data had from the expected difference of 0 for each airport at each month of the year from 2016 to 2019.

Discussion

Summary of Results

This project provides insights into flight dynamics at Belgian airports, with specific focus on the Brussels Airport, which is the largest and busiest one in Belgium. The results show that Brussels Airport consistently exhibits the highest monthly flight activity compared to other major airports in Belgium. Seasonal trends are evident, with notable peaks during the summer months and declines in winter. These facts support the hypothesis that Brussels Airport plays a central role in Belgian air traffic and highlight the importance of seasonality in air traffic management. Statistical analysis further confirms significant differences in flight activity across airports, underscoring Brussels Airport’s dominance in departures and arrivals.

Limitations and Possible Improvements

  • Data scope: This project performs analysis on flight data spanning from 2016 to 2019. This project intentionally excludes the COVID period, which has a significantly negative impact on the global air industry. To overcome it, we perform this work on larger dataset, to analyze long-term trend.
  • Outliers: This analysis focuses on overall trends and does not account for specific outliers. For example, in April 2016 (Fig. 1), flight activity decreased at Brussels Airport while increasing at other airports. This anomaly aligns with the terrorist attacks at Brussels Airport, as reported by BBC News (Brussels explosions: What we know about airport and metro attacks - BBC News).
  • Statistical Constraints: Hypothesis testing assumes independence among groups, which may not fully reflect the interconnected nature of air traffic systems. We can try some advanced methods to better mine the data and figure out the deeper relationships.
  • External Variables: The analysis only focuses on the results (number of changes) and does not account for external factors, such as economic conditions, or weather events, which may impact flight activity. To improve it, we need to combine several datasets, and include additional variables, like passenger volumes, flight delays, or economic indicators.

Future Works

In the future, we can do further analysis on:

  • The impact of flight status on passenger satisfaction.

  • Predicting flight activity based on historical records.

  • The influence of mobile device services (such as online check-in) on flight operations.

In conclusion, this project focuses on previous flight records of Brussels Airport in Belgian, highlighting the critical role of Brussels Airport in Belgian. The results exhibit the significance of seasonality in flight operations and provide a foundation for further exploration and practical applications in air traffic management and airport planning.