Abstract

EUROCONTROL’s Performance Review Commission launched the 2024 PRC Data Challenge in July 2024 with the aim of engaging with data scientists and aviation enthusiasts for the development of an open model to estimate an aircraft’s take-off weight. The dataset for the challenge represents a unique instance of otherwise difficult-to-obtain flight information and could be reused for educational purposes or to further improve the outcome of the challenge.

Introduction

True to its values of openness, transparency, and reproducibility, the EUROCONTROL Performance Review Commission (PRC), established in 1998 by EUROCONTROL’s Permanent Commission, provides objective information and independent advice to EUROCONTROL’s governing bodies on the performance of the European Air Traffic Management (ATM). The insights are provided based on extensive research, data analysis, and consultation with stakeholders. In 2023, PRC decided to promote a data challenge that could be of use to tackle the emerging issue of quantifying the impact of aviation on climate.

The PRC decided to focus the challenge on predicting the Actual Take-Off Weight (ATOW). ATOW is an essential input parameter for modeling the amount of fuel burnt during a flight and of gasseous emissions produced such as carbon dioxide (CO $_2$ ), nitrous oxides (NO $_x$ ), sulfur dioxide (SO $_2$ ) et al. Also important was the possibility to freely use the result of the challenge with openly available input data. The collaboration with OpenSky Network (OSN) and fellow researchers from TU Delft and ONERA made it possible to design the challenge and the companion data set that are described in the following sections.

Background

During the design of the challenge, our initial hypothesis is that ATOW should depend on the following factors:

Parameters related to the origin and destination:
- The geographical distance between the two airports of a flight influences how much fuel an aircraft will have to tank.
- Aerodrome of Departure (ADEP) or Aerodrome of Destination (ADES) may dictate Air Traffic Management (ATM) procedures like Standard Instrument Departure Route (SID)¹ and Standard Arrival Route (STAR)² that influence the trajectory flown and hence the extra fuel required.
- Both ADEP and ADES affect how an Aircraft Operator (AO) might plan and execute flights, for example, in selecting the potential airports for diversions, which can affect the decision on extra fuel to be carried on-board.
Information related to time:
- Depending on the time of day or day of the week when flights are planned, the flights may experience longer taxi times or measures influencing the capacity, such as re-routing, holding, and vectoring, all of which would affect the fuel decision.
- seasonal trends, such as the International Air Transport Association (IATA) season schedule³, local time, and flight duration, could also affect the weight of the flight.
Information on the aircraft (airframe): the International Civil Aviation Organization (ICAO) type⁴ will imply different aircraft performance profiles and hence different amounts of fuel needed
Airline: Policies vary for different airlines, which can affect the take-off weight. For the same city-pair, airlines could select a different alternate aerodrome to be used in case of diversion due to technical issues. Airlines could also have different fuel tanking policies.
Operational data: The actual flown route length, which is different from great circle distance, is caused by ATM constraints like regularly allocated military areas. This parameter could better refine ATOW estimation. A similar effect due to taxiway constraints also applies to the taxi-out operations.
The 4D trajectory itself: The Automatic Dependent Surveillance–Broadcast (ADS-B) trajectory data contains a lot of information that helps to classify the way a flight has been flown. For example, the rate of climb and maximum level of cruise flight are all dependent on the aircraft’s weight.

Method

Based on the previous hypothesis and availability of the data sources, we constructed the dataset for the PRC Data Challenge. It consists of:

Actual Take-off Weight (ATOW) data: Flight information from EUROCONTROL’s Network Manager (NM) augmented with derived Take-Off Weight (TOW) from airlines. The airline information is anonymized. We have extracted a total of $5.27162\times 10^{5}$ flights that were flown throughout Europe in 2022. This represents 6.1% of the flights from the EUROCONTROL airspace.
Trajectory data: State vector from the OpenSky Network [Schäfer et al. 2014] for the above flights, augmented with meteorological items from Copernicus ERA5 [Hersbach, H. et al.] via the fastmeto library [Junzi Sun 2025].

Due to data disclosure constraints, we could not identify the airline operators or the airframe (ICAO transponder code or registration number). So these parameters are not included in the open dataset.

Flight list with take-off weight data

The flight list used in the data challenge is derived from EUROCONTROL data, containing scheduled and non-scheduled flights, where we removed flights such as military, general aviation, sensitive, and state flights. The resulting bare flight list accounted for around 8,686,000 flights in 2022.

We further removed:

Flights with the same origin and destination airport
Flights with unknown airport, where ADEP or ADES with value ZZZZ or Air Filed (AFIL)⁵
Flights without callsign or ICAO transponder address, which is required to match ADS-B trajectories
Flights with no complete weight data, such as missing fuel weight, or only having fuel weight
Flights from airlines that have not shared or agreed to share the take-off weight data

After filtering, 1,006,051 flights, containing take-off weight information, have been retained for the data challenge.

Trajectories from ADS-B data

Based on this list of flights with take-off weight information, we extracted the relevant ADS-B trajectories from OpenSky’s historical data. The parameters used for extracting state vectors are:

icao24
callsign
date (the date of Actual Off-Block Time (AOBT))
start (five minutes before AOBT)
stop (thirty minutes after actual Arrival Time (ARVT))

The data extraction provided 527,162 trajectories, with the relevant flight list, which became the final ground truth flight dataset for the challenge.

For the purpose of automatic ranking, we split the dataset into different training and testing sets, the proportions are shown in Figure 1. The split between training and testing is random. We evaluated the distribution of the aircraft types to ensure the consistency between training and testing datasets.

The difference between the datasets are:

Part A: The training dataset, train.csv. It was named challenge_set.csv in the 2024 PRC Data Challenge. It consists of $369013$ rows of state vectors. This dataset is the one from which to learn and build the machine learning model: it contains the tow column with the ATOW values.
Part B: The initial testing dataset, test.csv. It was named submission_set.csv in the 2024 PRC Data Challenge. It consists of $105959$ rows. This dataset was used for submissions and ranking up to around one week before the deadline. It was the one to submit with a predicted value of ATOW in the tow column, which was not disclosed during the competition.
Part B + C: the final test dataset, test_final.csv. It was named final_submission_set.csv in the 2024 PRC Data Challenge. It consists of $158149$ rows. This dataset was used for the final ranking in the last phase of the challenge. It added $52190$ rows to the test dataset, test.csv.

Parameters in the final dataset

After the end of the data challenge, we deliver the full ground truth dataset in flight_list.csv. It consists of all the $527162$ rows, i.e. A + (B + C) inclusive of tow values.

The parameter names, description and units are listed as follows:

Flight identifications:
- flight_id: unique flight ID generated using traffic library
- callsign: obfuscated callsign of the flight
Origin and destination airports:
- adep: departure airport ICAO code
- ades: arrival airport ICAO code
- name_adep: departure airport name
- country_code_adep: departure country code
- name_ades: arrival airport name
- country_code_ades: arrival country code
Date and time:
- date: date of flight (UTC)
- actual_offblock_time: Actual offblock time (UTC)
- arrival_time: Arrival time (UTC)
Aircraft information:
- aircraft_type: ICAO aircraft typecode
- wtc: wake turbulence category, see footnote in Table [tbl-aircraft-types]
Airline information:
- airline: obfuscated airline code
Operational parameters:
- flight_duration: flight duration (in minutes)
- taxiout_time: taxi-out time (in minutes)
- flown_distance: route length (in nautical miles)
- tow: estimated take-off weight (in kg)

In terms of ICAO aircraft types, there are 30 distinct ones in the dataset; the top 10 account for around $82\%$ of the total flights, see Figure 2 and Table [tbl-aircraft-types].

The distribution of the aircraft types. The top 10 aircraft types account for more than 80% of the flights in the dataset.

In terms of city-pairs, there are 2836 (undirected) city-pairs in the dataset. The top 132 cover 50% of the traffic, see Figure 3.

All city pairs in the dataset (a.); the top 132 pairs accounting for 50% of the flights (b.) and all connections with at least 100 flights (c.)

The dataset shows the typical seasonality of summer peak and winter trough but not for all aircraft types, see Figure 4.

Monthly number of flights per aircraft type.

Data Archive

The data set for the 2024 PRC Data Challenge is available at
https://doi.org/10.4121/8cb8484b-dbe7-4750-8b87-a5b1dbc621b4

The overall size is around 286 GiB, mainly due to the trajectory files. The dataset is licensed under CC BY 4.0 license.

Supplementary tables

Acknowledgement

We are grateful for the support we received from the EUROCONTROL PRC and particularly the support from the Commissioner José Miguel de Pablo Guerrero.

Author contributions

If the paper has more than one author, the CRediT section must be included. See example usage on https://casrai.org/credit/

Enrico Spinielli: Conceptualization, Data, Writing- Original draft
Junzi Sun: Conceptualization, Data curation, Writing- Original draft
Martin Strohmeier: Conceptualization
Xavier Olive: Conceptualization
Quinten Goens: Conceptualization
Rainer Koelle: Conceptualization
Allan Tart: Conceptualization
John Fitzgerald: Conceptualization

Open data statement

The open dataset can be donwloaded from:
https://doi.org/10.4121/8cb8484b-dbe7-4750-8b87-a5b1dbc621b4

Reproducibility statement

The source code of all the competition teams can be found at:
https://github.com/PRC-Data-Challenge-2024/

Hersbach, H., Bell, B., Berrisford, P., et al. ERA5 hourly data on pressure levels from 1940 to present. https://cds.climate.copernicus.eu/datasets/reanalysis-era5-pressure-levels?tab=overview.

Junzi Sun. 2025. Open-aviation/fastmeteo. https://github.com/open-aviation/fastmeteo.

Schäfer, M., Strohmeier, M., Lenders, V., Martinovic, I., and Wilhelm, M. 2014. Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research. Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, 83–94.

A SID is a standard Air Traffic Service (ATS) route identified in an instrument departure procedure by which aircraft should proceed from the take-off phase to the en-route phase.↩︎
A STAR is a standard ATS route identified in an approach procedure by which aircraft should proceed from the en-route phase to an initial approach fix.↩︎
IATA Summer schedule for the year begins on the last Sunday of March and ends on the last Saturday of October of the same year.
IATA Winter schedule for the year begins on the Sunday after the last Saturday of October and ends on the Saturday before the last Sunday of March the next year.↩︎
and possibly the engine types and age, but these data points are not reliably or openly available and as such were not included in the Data for modeling dataset.↩︎
An AFIL is recorded by air traffic controllers and encodes a flight plan received from an aircraft already in flight.↩︎