The study of the environmental transition of the aviation sector calls for prospective traffic scenarios. Detailed traffic and emissions inventories are often needed to refine the available analyses and to enable the simulation of regionalised scenarios. In the past studies, these are generally based on commercial, proprietary traffic data, making their dissemination problematic and reducing the reproducibility of the science produced. Open-source alternatives do exist, but with limited geographical coverage. This paper presents a method to aggregate different sources of flight information, in order to obtain an open-source air traffic dataset for 2019. Then, missing flight information is identified and completed using an airline route database built from Wikipedia parsing and related socio-economic data. After that, several reference datasets are used to evaluate the accuracy of the extended open-source dataset. Despite varying accuracy for different routes, major traffic flows are reasonably well estimated at the country and continental levels. Finally, the CO2 emissions are obtained using an existing aircraft performance surrogate model, and the accuracies are examined compared to the results from previous studies.
In the context of climate change, it is necessary to quickly reduce greenhouse gas emissions to limit global warming consequences, and it requires the implementation of mitigation strategies across all business sectors [Climate Change 2022]. Although commercial aviation currently contributes to only 2.6% of those emissions, the trend is for this proportion to increase [Cames et al. 2015]. This is due to the air traffic expected growth [Airbus 2023; Boeing 2023] and the limited technological options to increase aircraft efficiency [Delbecq et al. 2023]. Achieving deeper decarbonisation requires in particular the use of Sustainable Aviation Fuels (SAF) such as biofuels, electrofuels or hydrogen if produced using low-carbon energies [Group 2021; NLR et al. 2021]. However, this raises other issues, such as the consumption of resources like biomass and electricity, which other sectors are also looking to as means of decarbonisation [Ansell 2023].
Some of these mitigation measures are likely to be more expensive for the air transport industry. For example, the widespread use of sustainable aviation fuels could increase the operating expenses of airlines by around 40% by 2050 [Salgas et al. 2023b], meaning that public policies such as taxes or subsidies will be necessary to foster their adoption. Different options exist to allow the implementation of such policies, such as blending mandates or aircraft efficiency regulations. The various low-carbon energies considered are not all equivalent, from both a sustainability and an economic point of view [Salgas et al. 2023b]. Choosing the adequate option could lead to substantial improvements in the energy transition efficiency [Salgas et al. 2023a]. This highlights the need for a detailed multidisciplinary evaluation of different prospective energy transition scenarios. Such work is done by both industrial actors [Group 2021; NLR et al. 2021] and academia [Grewe et al. 2021; Planès et al. 2021; Klöwer et al. 2021; Dray et al. 2022; Bergero et al. 2023; Sacchi et al. 2023].
A common requirement for prospective scenarios is to have emissions inventories in a base year from which trends can be projected to estimate future emissions. Such inventories could be based on detailed commercial flight schedule databases that are not open-source [Graver et al. 2020] or on the total fuel consumption of the sector, for instance from International Energy Agency (IEA) [Agency 2023]. This solution, despite being open-source and allowing free dissemination of data, is not detailed enough to capture the geographical disparities of air transport and the associated different growth perspectives [Airbus 2023; Boeing 2023]. Similarly, regional analysis capacities could be relevant when it comes to biomass or electricity characteristics or to better replicate the various coverages of existing legislative measures.
This work presents a methodology for estimating air traffic flows for a given year (2019) with an acceptable level of accuracy, based exclusively on open-source data. This paper is also part of the development of AeroMAPS, a dedicated open-source prospective scenario simulator used for instance in [Planès et al. 2023], because it especially aims at developing its regionalised assessment capacity in the future.
First, the data sources used are introduced and the data pipeline used is presented step by step in Section 2. The resulting dataset is then evaluated and validated in Section 3, before concluding remarks and perspectives are given in Section 4.
The overall process for obtaining an open-source database is described in this section. 2019 has been chosen as the reference year for building this database, as the following 3 years are largely disrupted by the consequences of COVID-19 [International Air Transport Association 2023]. The main objective of the process is to obtain, for each air route, the associated traffic volume, and if possible, the aircraft used to ultimately estimate the associated CO2 emissions. The traffic metric used is the number of seats available on each route, rather than the number of passengers because the former is a better proxy of the number of flights. This section summarises the overall process represented on Figure 1. The different steps are briefly described in the following paragraphs, and in more detail in dedicated sections.
There are numerous open-source datasets available, but none of them provide global coverage, which is only available from commercial sources. They are described in Section 2.2. As a first step, the chosen approach is to combine those datasets in order to achieve the greatest possible spatial coverage. In order to address overlapping sources, a prioritisation logic based on source characteristics is introduced.
Although the completeness of the combined dataset compared to individual sources is improved, it remains incomplete. To fill this gap, a specific method is proposed in Section 2.3. A comprehensive but disaggregated data source is used: the collaborative encyclopaedia, Wikipedia. In fact, there is a recommended design pattern for airport pages which includes a section that lists all the destinations served from the airport, along with the airlines [(community) 2009]. This information is easily accessible to the author of an airport’s Wikipedia article, as it is often available on the website of the airport. Automated retrieval of these lists is relatively easy, provided that the list of Wikipedia URLs associated with commercial airports is available. This process provides a much more comprehensive list of routes than was previously available. This is described in Section 2.3.1. However, there is no information on the seating capacity associated with each route or the frequency of flights. It must therefore be estimated. To do this, the open-source data mentioned above is used again, this time to train a regression model. It uses economic, geographical and statistical data associated with the airport and/or country of origin and destination of each flight. The sources and models used are described in Section 2.3.2 and 2.3.3. Once the training is complete, it is possible to determine the traffic on each route found previously.
Concerning traffic data, the final step is to aggregate this estimation with open-source data, prioritising the latter where available. The method is described in Section 2.4.
Lastly, CO2 emissions are computed as explained in Section 2.5. If the aircraft model is known, a surrogate aircraft performance model [Seymour et al. 2020] is used, and otherwise, the fleet average consumption per seat at the corresponding distance is used.
Various open-source flight data are available online, as shown in Table 1, but each one has its own limitations. The estimation of air transport CO2 emissions often do not require detailed trajectory of individual flights. Instead, a relatively aggregated level of data is usually sufficient.
The first category of sources comes from administrations (Adm.). The format offered by [Statistics 2022] is particularly adapted, with each data item compiling all the monthly flights of a given airline, with a given aircraft type, on a given route and with the associated payload. However, the extent of the database is limited to flights going to and from the United States of America (USA). A relatively similar dataset is made available by the World Bank [Bank 2023b], and includes all the international flights. The drawback in this case is that the database does not provide information on the airlines and aircraft used. It is not a maintained database, with only a single edition, with the most recent data items corresponding to 2019, which corresponds to the studied year in this paper. Brazilian [Aviação Civil (ANAC) 2023] and Australian [Government 2023] civil aviation authorities also provide such information with various levels of information as it can be seen in Table 1.
The other category of sources comes from radar or ADS-B monitoring of flights. This is offered by [Schäfer et al. 2014; Eurocontrol 2023], the latter being open-access for academia but not fully open-source. The former is completely open-source and based on a collaborative ADS-B collection network but still lacks coverage1. Those radar sources do not feature payload information but they provide information on the aircraft used and its operator contrary to most administrative sources. In this case, the payload can be retrieved using the average seating capacity of each aircraft type (using an aircraft database made available for academia2) and average load factors.
| Source | Coverage | Collection | Route | A/C Type | Airline | Payload | Ref. |
|---|---|---|---|---|---|---|---|
| BTS T-100 | To/from USA | Adm. | [Statistics 2022] | ||||
| OpenSky | Global (partial) | Radar | - | [Schäfer et al. 2014] | |||
| Eurocontrol | To/from EU | Radar | - | [Eurocontrol 2023] | |||
| World Bank | International | Adm. | - | - | [Bank 2023b] | ||
| ANAC | Brasil | Adm. | - | [Aviação Civil (ANAC) 2023] | |||
| AUS Stats | Australia | Adm. | - | - | [Government 2023] |
As explained in the introduction of this section, the Wikipedia pages of airports are used as a source to establish a complete route network, without however knowing the traffic on each route. The first step is to reference all airports served by commercial airlines available on Wikipedia. A community-based list of airports served is available for each continent [(community) 2015]. The related URLs of airport Wikipedia pages are retrieved by parsing the list using an HTML analysis library3. For each airport found, another Python script opens the URL and analyses the HTML content of the page to find the "Airline and Destinations" section. This section contains a table with the Wikipedia URLs of all the airports served by each airline from the explored airport. This data is stored and, after iterating over all the airports, a complete route database can be established. The associated code and further explanations on this data parsing step are given in the associated Jupyter Notebooks (see Reproducibility statement).
Note that an important limitation of the process lies in the dynamic nature of Wikipedia pages, which are constantly updated by the community. No viable option was found to extract the data at a given point in the past, and therefore the parsed route database is used as it was in April 2023. Another important point to note when working with Wikipedia is that, despite the community’s proofreading efforts, errors may still persist in some airports. Others may be missing from the list of airports used in the first place, which was subsequently merged with the list of destinations to reduce errors.
The previously established list of airports is enriched by adding relevant features to build a regression model, using routes included in the open-source data to train a model.
The first relevant set of features is directly related to each airport. Besides collecting airport’s destinations in the previous step, the passenger traffic, aircraft movements and airport codes of each airport are also collected on the Wikipedia pages of the airport. A Wikidata item is also linked to each airport page4. Similar data (for example, annual passenger traffic, ICAO and IATA codes) can be found, while an advantage of Wikidata is that the fields are dated, compared to Wikipedia "last-year available" information. Not every airport has all these features and the list is filtered to retain only airports with an IATA code. Airport geographical coordinates, countries and continental codes are added by merging an airport database [OurAirports 2023] with the airport list. When it comes to estimating the traffic on a given route, besides the airport traffic itself, the neighbouring population of airports could be relevant, especially when the airport traffic information is not available from Wikidata or Wikipedia sources. A global population dataset [Kontur 2023] is used to determine the population in the vicinity of airports at three levels (in hexagons inscribed in 30, 70 and 150 km radius circles). Similarly, the number of competing airports in the same vicinity regions is calculated.
The second group of features used refers more to the countries themselves. With a correlation between the number of kilometres flown per inhabitant and the country’s wealth [Gössling and Humpe 2020], some socio-economic metrics are used. The Gross Domestic Product (GDP) per capita (Power Purchase Parity) is used to capture the raw wealth of each country [Bank 2023a]. The inequality structure is captured by the Gini coefficient of incomes [Bank 2023a] and the Inequality-adjusted Human Development Index (IHDI) [United Nations Devlopment Programm 2023]. Since tourist countries are more likely to be served by more flights, the number of inbound and outbound tourists, as well as the share of tourism in the exports of each country is also used as a feature [Bank 2023a]. Their surface and their insularity are also used because one can think that the size of the country affects the number of domestic flights [(community) 2023; Bank 2023a].
Those airport/country-related features are added to the previous airport dataset. The completed airport database is merged into the route database and route-related features are added to the dataset. This final group of features deals with bilateral relations between airports on the different routes. These include the bilateral trade flows [Trade Statistics], the number of airlines serving each route (established during the Wikipedia list parsing) and the great circle distance between the two airports.
The route database is completed by a dependent variable: in this case, the number of seats available on each route. To do so, the open-source datasets described in Section 2.2 are used. They are merged into the Wikipedia-parsed route database. Note that trying to infer the aircraft type used or its operator was not achieved for the sake of simplicity.
For the values taken by the dependent variable, it is more suited to use the various administrative sources. Indeed, using radar sources requires converting a number of flights in a number seats offered per route with average aircraft capacities and load factors, which reduces the data accuracy. Moreover, in the case of [Schäfer et al. 2014], a general trend towards traffic underestimation was found when the concerned relation was also included in another dataset. Figure 2 demonstrates this trend comparing radar data with BTS data, and the same trend is observed when comparing with the World Bank dataset. It could be explained by the fact that only a partial number of flights on a given route were detected by the ADS-B collection network. Indeed, either the origin/destination or aircraft could be unknown (or mismatched) for some flights on a route, resulting in a capacity underestimation once data is aggregated and compared to an administrative complete source. Note that OpenSky coverage has surely been improved since 2019. The same phenomenon can be seen for Eurocontrol data although with a different pattern: either the data is well correlated, or not at all, suggesting that some individual flights were included in the Eurocontrol dataset without being in its nominal coverage zone. Note that the joint coverage of BTS and Eurocontrol datasets is limited to only the Europe-USA flights, explaining the relative scarcity of data points in the comparison of Figure 2.
The priority order retained is thus [Statistics 2022; Bank 2023b; Aviação Civil (ANAC) 2023; Government 2023; Eurocontrol 2023; Schäfer et al. 2014], using first the administrative although less detailed sources and then the radar sources. This is achieved because only the dependent variable is of interest at this point and not the other details provided by the dataset.
After merging those sources, 41% of the Wikipedia-scrapped routes remain traffic-undetermined and will be estimated. To do so, several regression methods are tested using the 59% traffic-determined routes and a typical 80%-20% train-test data split.
First, a linear regression is performed using Eq. [eq:linear], where \(S_{AB}\) are the seats offered between cities \(A\) and \(B\), \(\alpha_i\) the regression coefficient relative to feature \(F_i\). The results are inadequate despite an acceptable r² for the purpose (between 0.3 and 0.5, depending on the train/test split). Indeed, a simple linear regression induces some negative estimates. It means those values should be forced to zero afterwards not to include "negative" seats available. Moreover, very highly frequented routes are highly underestimated as it can be seen in Figure 3. Since no particular caution was taken in selecting non-colinear features, regularisation techniques (lasso regularisation) are tested without improving the metrics. In fact, without regularisation, many coefficients are already null. The relationship between the dependent variable and the predictors therefore appears to be non-linear.
\[\label{eq:linear} S_{AB, lin} = \alpha_0 +\alpha_1 F_1 + ... + \alpha_n F_n\]
Then, a reduced gravity model, given in Eq. [eq:gravity] where \(P_K\) is the population in city \(K\), \(I_K\) the income per habitant, \(D_{AB}\) the distance between two cities and \(x_P, x_I, x_D\) the relative log-linear regression coefficients, is tested to simply account for potential non-linearities, but it is also insufficient (Figure 3). A very low r²=0.05 is found. More features could have been added to improve this, with however large restrictions on the data entries used (no zeros allowed). This path was not investigated.
\[\label{eq:gravity} S_{AB, log-lin} = \frac{(P_A\times P_B)^{x_P} \times (I_A\times I_B)^{x_I}}{D_{AB}^{x_D}}\]
Instead, more sophisticated machine learning methods are tested, starting with a random forest[Breiman 2001]. This combines several weak random regression trees to produce a reliable estimator of the dependent variable little prone to overfitting. The r² is improved to 0.7 (depending on the train-test split). Compared to the linear regression, there are no more negative estimates. However, the random forest regressor does not handle missing values (NaN), just like previous regressors. Therefore, data entries with missing features must be removed from the dataset, or the missing feature can be imputed arbitrarily (using for instance a defined quantile: here the first 1000-quantile was used, following the idea that missing features are more likely to be found on small airports).
Some regression algorithms that also handle missing values could be used to avoid this intervention on the dataset. It is the case of XGBoost [Chen and Guestrin 2016], a tree-based gradient-boosted regression method. XGBoost was tested and provided fast and slightly improved results over the random forest. The fast training speed allowed testing on several random train-test splits and the r² is between 0.65 and 0.75 on most tests. A specific loss function (following a Tweedie distribution) was chosen to prevent negative estimates and to allow large amounts of routes to have low traffic.
Due to its ease of use and to avoid missing feature imputation (not having a feature is information in itself), the XGBoost regressor is selected to estimate the number of seats on unknown routes. It is trained a last time on the whole known dataset, before its final usage on traffic-undetermined routes. The relative importance of each feature used by the regression model is given in 5.
This section presents the final dataset aggregation method which consists of two stages.
In a first step, the open-source databases mentioned in Section 2.2 are used again, but this time as a way to construct an aggregated database combining the different sources rather than to complete the Wikipedia-built route database. The aggregation logic differs from the one given in Section 2.3.3. In this previous case, the accuracy of the dependent variable (number of seats available per route) was of interest rather than the exact model of aircraft flying each route. However, if one is interested in CO2 emissions as a primary goal, it is more relevant to have access to the aircraft type, rather than the exact payload. Indeed, the surrogate fuel burn model used, which is the one preferred for accurate estimations (see Section 2.5), requires the aircraft type and the flight distance, not the payload. Moreover, having as much aircraft and airline information as possible could be of interest in performing airline network analysis for instance. This means that the radar sources should be favoured over the administrative sources. To avoid the underestimation phenomenon described in Section 2.3.3 while keeping aircraft information, radar sources are prioritised, but with an administrative source as a validation backup. This backup is used to decide if the radar data is chosen or not for each route in the aggregated dataset. If the gap with the backup is too high, the administrative source is used, losing the aircraft-type information. The priority order is [Statistics 2022; Schäfer et al. 2014; Eurocontrol 2023; Aviação Civil (ANAC) 2023; Bank 2023b; Government 2023].
In a second step, the estimated data from Section 2.3.3 is used to complete the route database with the additional routes missing in the open-source databases. Since the scope of this work aims at having a relatively reliable estimate of air traffic at the regional level rather than the route level, an extra scaling step is performed on these estimated data. It consists in using once again the passenger traffic information of each airport \(PAX_{AP_i}\) parsed on Wikipedia. Once all flights \(PAX_{FL_{AP_i-AP_j}}\) of open-source data going to or from this airport are accounted for, there could be a residual traffic \(\Delta_{PAX_{A_i}}\) as shown in Eq. [eq:delta]. This value should correspond to the estimated data. In this case, \(\gamma_{PAX,AP_i}\) of Eq. [eq:scaling] would be equal to 1. If this is not the case, in an iterative process, each route capacity is multiplied by its origin and destination airport scaling factor as shown in Eq. [eq:scaled]. Bounds are specified to restrict this relatively rough process. It converges very quickly (8 rounds) to a minimal residual three times lower than originally. The effect on airport residuals is shown in Figure 4. Note that the process could degrade the route-level accuracy by altering the results obtained with the estimator.
\[\label{eq:delta} \Delta_{PAX_{AP_i}} = PAX_{AP_i}~- \sum_{Open-source, AP_j} PAX_{FL_{AP_i-AP_j}}\]
\[\label{eq:scaling} \gamma_{PAX,AP_i}= \Delta_{PAX_{AP_i}}~/\sum_{Estimated, AP_j} PAX_{FL_{AP_i-AP_j}}\]
\[\label{eq:scaled} PAX_{FL_{AP_i-AP_j}}^{*} = PAX_{FL_{AP_i-AP_j}} \cdot \gamma_{PAX,AP_i} \cdot \gamma_{PAX,AP_j}\]
Finally, the corrected estimated data is aggregated on top of the open-source data. The number of seats attributed to each source is given in Table 2.
| Source | BTS | OpenSky | Eurocontrol | W.Bank | ANAC | AUS. | Estimation | Total |
|---|---|---|---|---|---|---|---|---|
| Mn Seats (%) | 1295 (23.2) | 207 (3.7) | 1346 (24.1) | 862 (15.5) | 133 (2.4) | 69 (1.2) | 1657 (16.5) | 5570 |
| Bn ASK (%) | 3027 (28.4) | 551 (5.2) | 2738 (25.7) | 2338 (21.9) | 172 (1.6) | 79 (0.7) | 1758 (16.5) | 10664 |
It is also interesting to compute the Available Seat Kilometres (ASK), which is a widely used traffic metric. It is obtained by multiplying the number of seats available on each route by the corresponding great-circle distance.
The previous work focused on estimating traffic data on each route. The process of estimating the associated fuel burn requires using an aircraft fuel burn model. Several models are available at different levels of fidelity. For instance, OpenAP [Sun et al. 2020] is a detailed open-source model, but requires the real flight path to determine the aircraft fuel burn. This level of information is not available in the current dataset. Therefore, a very simplified surrogate model, FEAT [Seymour et al. 2020], is used. It requires only the knowledge of the aircraft type and the distance of the flight to estimate the fuel burn. The aggregated error at the fleet level is reported to be below 5%.
Two cases are present in the aggregated dataset of this paper. Some of the collected sources have an aircraft model associated with each data entry: it is the case for [Statistics 2022; Schäfer et al. 2014; Eurocontrol 2023]. As illustrated in Table 2, these entries represent 51% of the total seats offered and 59.3% of the ASK. FEAT can be applied directly to the data items to compute the associated fuel burn. However, in the case of the other sources and of the estimated data (representing 49 and 40.7% of the seats and ASK), the aircraft type information is not known. Therefore, a regression is performed on the aforementioned FEAT-computed data points to derive a fuel burn per seat function of the flight distance (Figure 5). To account for the fact that the data points represent a variable number of flights, this regression is weighted according ot this variable. It gives a satisfactory level of fidelity for the use case considered, with a high weighted r² of 0.95, but it should be reminded that the regression is based on a surrogate model and not the actual fuel burn. Some outliers can be seen on Figure 5, and especially at lower ranges with a group of minor data entries whose fuel burn increases quickly compared to the trend. They are related to the quality of [Statistics 2022] data, in which the seats offered are specific to each item. It therefore includes VIP charter flights with very low cabin density for which the fuel burn per seat increases rapidly. This effect cannot be seen for the other sources, in which an average value per aircraft type is considered. The fuel burn corresponding to each remaining route is then computed using this regression. The associated equation is given by Eq. [eq:fuelreg].