Abstract

This paper introduces a gradient-based Smart Predict-then-Optimize (SPO) framework to solve the aircraft arrival scheduling problem (ASP) in the terminal maneuver area. Traditional approaches to ASP typically separate arrival time prediction from scheduling optimization, potentially leading to incomplete solutions. We address this limitation by developing an end-to-end learning framework that directly integrates prediction with optimization objectives. Our methodology introduces the concept of traffic instances for simultaneous prediction of multiple aircraft arrival times, coupled with a Mixed Integer Programming (MIP) model for scheduling optimization. We evaluated our approach using real-world data from London Gatwick Airport, analyzing arrival flights from June to September 2024, organized into traffic instances. The framework incorporates comprehensive weather data through the ATMAP algorithm, considering factors such as wind, visibility, precipitation, and dangerous phenomena. Experimental results demonstrate that the MLP+SPO+ framework shows particular effectiveness in adapting to adverse weather conditions, strategically balancing transit times with operational efficiency. While the minimum time interval is required, the MLP+SPO+ will reach around $85.0 %$ and $43.4 %$ lower costs compared with the First-Come-First-Serve (FCFS) cost and optimized true cost, respectively. These findings suggest significant potential for improving arrival scheduling efficiency through integrated SPO approaches.

Introduction

Aircraft Arrival Scheduling Problem (ASP) is a crucial challenge in the field of Air Traffic Management (ATM). As global air traffic continues to grow, optimizing the sequence and schedule in which/when aircraft land at airports within Terminal Maneuvering Area (TMA) has become foremost. Efficient arrival scheduling not only reduces fuel consumption and carbon emissions but also significantly improves overall air traffic flow, making it a key focus for both researchers and practitioners in the field. The ASP, classified as an NP-hard problem, has spurred the development of various approaches to tackle its complexity. Traditional methods like First Come First Serve (FCFS) have laid the groundwork, while advanced techniques such as the Trombone [Sprong et al. 2005; Sáez et al. 2020] and Point Merge System (PMS) [Boursier et al. 2007] leverage geometric principles to further enhance efficiency. These innovations underscore the ongoing importance of solving the ASP to maintain safety, minimize delays, and optimize airport operations in increasingly congested airspace.

Addressing ASP has changed significantly in recent years as a result of increasing access to aeronautical data and rapid advances in machine learning (ML). Researchers have successfully applied diverse ML techniques to predict Estimated Time of Arrival (ETA) and arrival transit times with unprecedented accuracy. These advanced prediction models have not only enhanced our understanding of arrival patterns and potential delays but have also opened up new avenues for optimization. However, a significant gap remains in the field: while ETA prediction has seen substantial progress, the integration of these ML-driven predictions into optimization algorithms for ASP has been relatively unexplored, particularly in terms of optimization performance. Traditional two-stage approaches focus on minimizing prediction errors of certain parameters, typically using metrics such as Mean Square Error (MSE) ( $\frac{1}{2} | | c - \hat{c} | |_{2}^{2}$ ) or Mean Absolute Error (MAE) ( $| | c - \hat{c} | |_{1}$ ). After hyperparameter tuning and a training-validation procedure, the predicted parameter ( $c^{*}$ ) is passed to a downstream optimization model. While these approaches have yielded valuable insights, they face significant limitations: 1. the emphasis on prediction error metrics fails to capture the quality of resulting decisions; 2. the disconnect between prediction and optimization stages can lead to feasibility issues.

This study aims to address these limitations by applying the smart predict-then-optimize (SPO) framework to the ASP within TMA. This approach is particularly relevant for the ASP because, even with fixed Standard Terminal Arrival Routes (STARs) and observable weather conditions, aircraft arrival transit times within TMAs can vary significantly due to unexpected factors that may influence decision errors during the landing process. Our work pioneers the application of the gradient-based SPO framework in the air transportation domain. Furthermore, we apply this framework to address a critical challenge in ASP: the incorporation of adverse weather conditions consideration.

The structure of this paper is as follow: 2 constructs a literature review for related works, and 3 introduces our methodologies. In 4, we briefly introduce our case study at London Gatwick airport and the setup of our experiment. 5 presents the results and discussion while 6 concludes this work.

Literature Review

Arrival scheduling is a critical factor in ensuring efficient operations within terminal maneuvering areas (TMAs). A central challenge involves assigning landing times to aircraft while adhering to separation criteria between successive arrivals.

Prior studies frame this as an aircraft landing scheduling problem (ASP), where each aircraft must land within a predetermined time window bounded by an earliest and latest time [Beasley et al. 2000]. These temporal constraints reflect operational realities:

The earliest landing time represents the soonest achievable arrival under ideal conditions (e.g., maximum permissible speed, direct routing), while
The latest landing time accounts for delay absorption capabilities via speed adjustments, path stretching, or holding patterns, constrained by fuel limits and airspace procedures.

This time window ensures efficient airspace utilization while accommodating uncertainties such as weather or traffic conflicts. Solutions aim to minimize deviations from target times and maintain safe separation, often derived from wake vortex categories or air traffic control (ATC) regulations. While early ASP formulations focused on single-runway allocation [Beasley et al. 2000], extensions to multi-runway systems have become increasingly relevant for high-density airports.

There are different approaches to solve this problem in the literature. Some studies focused on exact algorithms and optimization models [Beasley et al. 2000; Pohl et al. 2021] while some others utilized heuristic and meta-heuristic algorithms to take advantage of reducing solving period [Beasley et al. 2001; Sama et al. 2015; Xu 2017; Prakash et al. 2018]. One study was focused on forming an heuristic algorithm to increase scheduling efficiency of arrival aircraft at London Heathrow. The algorithm showed that it could have the potential to increase the efficiency of the decisions made by air traffic controllers [Beasley et al. 2001]. In order to reduce the workload of air traffic controllers and congestion in airports, a metaheuristic algorithm was applied to a good initial solution to take advantage of its short computing time and the study was carried out in two Italian airports [Sama et al. 2015]. The use of an Ant Colony algorithm was investigated to focus on the aircraft scheduling problem. The algorithm was based on wake vortex modeling and findings are compared to some methods. This study showed that the algorithm based on wake vortex modeling revealed better results than models such as CPLEX, general ant colony algorithms, and approximation algorithm[Xu 2017]. A data splitting algorithm was used to solve the aircraft sequencing problem. The model, 0-1 mixed integer programming, was employed with many different realistic constraints. The algorithm had small run times enabling a real-time deployment of the concept[Prakash et al. 2018]. For more details concerning the aircraft scheduling problem, we refer two review studies on this topic [Messaoud 2021; Ikli et al. 2021].

In recent years, the landscape of arrival management research has been transformed by the increasing availability of aviation data, leading to a surge in ML-based approaches for arrival time prediction. The effort that has been spent on predicting arrivals flight time and its contribution to different ATM solutions are important to have more predictable, efficient and greener operations in TMAs [Zhang et al. 2022]. ML has an important role on reaching the goals contributing to providing better air traffic management. In the existing literature, there are different application of its algorithms focusing on Estimated Time of Arrival (ETA) / arrival flight time [Glina et al. 2012; Kern et al. 2015; Ayhan et al. 2018; Takacs 2014; Ma et al. 2022; Silvestre et al. 2024; Lui et al. 2025].

Quantile Regression Forests [Glina et al. 2012], a tree-based ensemble method, was employed for estimation of landing times. A total of 4011 cases were separated 67% and 33% for training and testing respectively. As stated in the research, the model was suitable to predict landing times in real-time applications. Random Forest (RF) [Kern et al. 2015], a well-known tree-based method, was utilized to improve prediction on ETA. In the application, feature generation and selection was one of the main focus points. As a result of this study, they showed that 78% of total instances have better accuracy within the ML algorithm against Enhanced Traffic Management System in US. Some regression models (Linear, Non-linear and Ensemble) and Recurrent Neural Network [Ayhan et al. 2018] were tested to perform prediction of ETA for commercial flights by comparing their model results with EUROCONTROL ETA predictions. One of the main outlines of this study was higher accuracy with smaller standard deviation which made smaller prediction windows of ETA possible. Spatiotemporal Neural Network Model for ETA [Ma et al. 2022] was proposed with three main stages that were trajectory pattern recognition, trajectory prediction and arrival time prediction. At the conclusion of their research, one of the findings was that the MAE was typically lower with shorter travel times to the destination. A deep learning approach based on Long-Short Term Memory [Silvestre et al. 2024] was used to predict ETA by utilizing 4D trajectory of the aircraft and weather data. In addition to the model’s result, this research came to the front with its application airport, Madrid Barajas-Adolfo Suárez (Spain). The performed model was superior to RF, Gradient Boosting Machines (GBM) and Adaptive Boosting that were selected as baseline in the study. Ridge Regression (RR) and GBM [Takacs 2014] were selected to predict runway and gate arrival time of flights, based on historical, weather, air traffic control and given data during the data science contest named as GE Flight Quest.

Despite these significant advances in both optimization and prediction domains, several gaps remain in the current literature. Because most researchers handle these problems separately, there exists a disconnect between arrival time prediction and scheduling optimization. While both areas have seen remarkable progress independently, the potential benefits of integrating prediction capabilities into optimization frameworks remain largely unexplored. Few studies have explored this area, but they mostly used the predicted values directly for the downstream optimization [Du et al. 2023; Pang et al. 2024]. The relationship between prediction accuracy and operational efficiency improvements needs more thorough investigation. Traditional methods also often fail to capture the dynamic nature of the airport environment, where predictions and scheduling decisions need to be made and updated continuously in response to changing conditions.

Recent developments in computational frameworks offer promising directions for addressing these limitations. SPO framework [Elmachtoub and Grigas 2022] provide a structured approach to integrating prediction and optimization, potentially offering a more coherent solution to the arrival scheduling problem. Similarly, learning-to-optimize techniques [Li and Malik 2016], which directly learn optimization strategies from data, may offer more robust solutions than traditional two-stage approaches. However, while these frameworks show theoretical promise, their practical application in aviation context remains limited. Key challenges include adapting these frameworks to handle the specific constraints and objectives of airport operations and validating their performance under real-world conditions and operational constraints. Given these challenges and opportunities in the existing literature, this research proposes the SPO framework for ASP inside the TMAs. The following section details our proposed approach and its implementation.

Methodologies

1 presents the general schematic diagram of our proposed method. Starting from the raw flight data, we generate an input dataset $D$ through a series of data preprocessing, including data trimming, cleaning, and re-alignment. $D$ consists of $K$ independent traffic instances with the same number of flights, where each instance is represented as a pair $(x, c)$ . For each instance, the input features $x$ are structured as a vector contains $m \times n_{t}$ features, where $m$ represents the number of input features for each flight, $n_{t}$ represents the number of flights in each traffic instance. The corresponding output costs $c$ are represented as a vector of length $n_{t}$ , where each element represents the cost associated with each flight in the traffic instance. Therefore, the input dataset can be denoted as ${(x_{k}, c_{k})}_{k = 1, . . ., K}$ .

The schematic diagram of end-to-end smart predict-then-optimize framework for aircraft arrival scheduling problem

Based on the dataset $D$ , we can implement the SPO framework [Tang and Khalil 2024]. Considering ASP as an integer programming problem, we have several key elements: a feasible region $S$ , an optimal objective value $z^{*} (c)$ corresponding to objective coefficients $c$ , and an optimal solution $w^{*} (c)$ . Such optimization model will be embedded into a differentiable prediction model $g (x | θ)$ , such as neural networks, through the decision loss $L (\cdot)$ .

The core function of this framework is the gradient computation and the parameter updates through the backpropagation. For each training instance, the gradient $\frac{\partial L}{\partial θ}$ is computed by applying the chain rule. $\frac{\partial L}{\partial θ} = \frac{\partial L}{\partial w^{*}} \frac{\partial w^{*}}{\partial \hat{c}} \frac{\partial \hat{c}}{\partial θ}$ . Here, $\frac{\partial L}{\partial w^{*}}$ measures how the decision loss changes with respect to the optimal solution, $\frac{\partial w^{*}}{\partial \hat{c}}$ captures the sensitivity of the optimal solution to changes in the objective coefficients, and $\frac{\partial \hat{c}}{\partial θ}$ represents how the predicted coefficients vary with the model parameters. Through this gradient chain, the framework enables end-to-end training where the optimization outcomes directly influence the prediction model’s parameter updates.

Aircraft arrival scheduling problem formulation

In this work, we formulate the ASP as a simple Mixed Integer Programming (MIP) model based on the classical single runway aircraft landing problem proposed by [Beasley et al. 2000]. We assume:

$A = {1, \dots, n}$ : Set of aircraft, where $n$ is the total number of aircraft
$i, j \in A$ : Aircraft indices
$T_{i}$ : The target (expected) landing time for aircraft $i$
$E_{i}$ : The earliest landing time for aircraft $i$
$L_{i}$ : The latest landing time for aircraft $i$
$s_{i, j}$ : The required separation time between $i & j$ , where $i$ lands before $j$
$c_{i}$ : Delay costs for aircraft $i$ landing after the expected time $T_{i}$
$M$ : A large constant
${\hat{T}}_{i}$ : The predicted transit time for aircraft $i$

The decision variables in our models are:

$y_{i}$ : Actual landing time of aircraft $i$
$ω_{i}$ : Binary variable indicating if aircraft $i$ lands after its expected time $ω_{i} = {\begin{cases} 1 & if y_{i} > T_{i} \\ 0 & otherwise \end{cases}$
$δ_{i, j}$ : Binary variable for aircraft arrival scheduling $δ_{i, j} = {\begin{cases} 1 & if aircraft i lands before aircraft j \\ 0 & otherwise \end{cases}$

The objective of this model is to minimize the sum of costs for all delayed aircraft, where: $min \sum_{i \in A} c_{i} ω_{i}$

The model formulation is listed as follows: $\begin{aligned} 3 s.t. \\ E_{i} \leq y_{i} \leq L_{i} & \forall i \in A \\ y_{i} - T_{i} \leq M \cdot ω_{i} & \forall i \in A \\ y_{i} - T_{i} \geq - M \cdot (1 - ω_{i}) & \forall i \in A \\ δ_{i, j} + δ_{j, i} = 1 & \forall i, j \in A, i \neq j \\ y_{j} - y_{i} \geq s_{i, j} - M \cdot δ_{j, i} & \forall i, j \in A, i \neq j \\ y_{i} \in R & \forall i \in A \\ ω_{i} \in {0, 1} & \forall i \in A \\ δ_{i, j} \in {0, 1} & \forall i, j \in A, i \neq j \end{aligned}$

Our ASP seeks to minimize delay-related costs. At its core, the mathematical formulation employs a simple objective function that sums the costs across all delayed aircraft. Three decision variables drive the model: continuous variables $y_{i}$ for landing times, binary indicators $ω_{i}$ for delays, and ordering variables $δ_{i, j}$ that establish the sequence of operations between aircraft pairs. These variables work in concert to capture all necessary scheduling decisions.

Constraint [cons::E] ensures that each aircraft $i$ must be scheduled within its feasible time window $[E_{i}, L_{i}]$ . Constraint [cons::x1] and [cons::x2] define whether an aircraft is delayed using the big-M method. If the actual arrival time $y_{i}$ exceeds the expected time $T_{i}$ , the aircraft is considered delayed ( $ω_{i}$ = 1). The constraints work in pairs to force $ω_{i}$ to take the appropriate binary value. Constraint [cons::ordering_const] refers to the ordering constraint, in which any pair of aircraft $(i, j)$ , either $i$ must precede $j$ or $j$ must precede $i$ . Constraint [cons::sepeation_const] works in conjunction with the ordering constraint [cons::ordering_const] to ensure proper separation between any pair of aircraft:

When aircraft $i$ lands before $j$ ( $δ_{i, j}$ = 1, $δ_{j, i}$ = 0):
- The constraint becomes: $y_{j} - y_{i} \geq s_{i, j}$ , this enforces the minimum separation time $s_{i, j}$ between landings.
When aircraft $j$ lands before $i$ ( $δ_{i, j}$ = 0, $δ_{j, i}$ = 1):
- The constraint becomes: $y_{j} - y_{i} \geq s_{i, j} - M$ , the large M term makes this constraint non-binding.
- Meanwhile, the complementary constraint $y_{i} - y_{j} \geq s_{j, i} - M \cdot δ_{i, j}$ becomes active.
- This enforces the minimum separation time $s_{j, i}$ between landings.

Thus, the pair of constraints ensures proper separation regardless of landing order, with $s_{i, j}$ applied when $i$ precedes $j$ and $s_{j, i}$ applied when $j$ precedes $i$ . The rests are domain constraints for the decision variables.

The conventional delayed cost definition is $c_{i} = c_{i}^{*} \cdot ({\hat{T}}_{i} - T_{i})$ , where $c_{i}^{*}$ denotes the unit time delayed cost for each aircraft type [Cook and Tanner 2011], $({\hat{T}}_{i} - T_{i})$ refers to the delayed time. For our optimization framework, we can simplify this cost representation due to two key observations. First, the expected arrival time $T_{i}$ is known before the prediction task begins. Second, the unit delay cost $c_{i}^{*}$ , which varies by aircraft type and is typically derived from extensive operational cost studies, is also predetermined. Given these fixed parameters, the delay cost $c_{i}$ maintains a direct proportional relationship with the predicted arrival time ${\hat{T}}_{i}$ . This proportional relationship enables us to streamline our cost representation by using ${\hat{T}}_{i}$ directly as our cost metric ( $c_{i} \approx {\hat{T}}_{i}$ ). While this simplification might appear to lose some granularity, it preserves the essential mathematical properties needed for optimization while reducing computational complexity.

Costs prediction via traffic instances

Traditional approaches to ETA prediction focus on individual flight independently. For each flight $i$ , $m$ input features–comprising pre-terminal flight data, meteorological conditions, and historical patterns–to forecast the estimated flight duration ${\hat{T}}_{i}$ for each flight. When integrating ML with the optimization framework, we need to reconceptualize the prediction task to align with the objective. In SPO framework, the ML model iteratively attempts to minimize the decision loss—a task that requires optimization for multiple aircraft than individual. To address this issue, we propose traffic instances, which refers to a certain air traffic scenario that contains the same amount of flights that needs to be resolved. Instead of mapping $m$ features to a single flight duration, we predict flight times for an entire traffic instance simultaneously. Each traffic instance contains $n_{t}$ flights, transforming our input dimension to $n_{t} \times m$ features and generating outputs that directly correspond to the costs ( $n_{t}$ features) for decision loss computation.

Flight sequence $F = {f_{1}, . . ., f_{m}}$ ordered by entry time Instance size $N$ Maximum time interval $Δ T_{m a x}$ Set of non-overlapping instances $I$ where: . Each instance contains exactly $N$ flights . All flights in an instance occur within $Δ T_{m a x}$ . No two instances share any flights

$I \leftarrow \emptyset$ $i \leftarrow 0$

$G \leftarrow {f_{i}, . . ., f_{i + N - 1}}$ $Δ T \leftarrow f_{i + N - 1} . time - f_{i} . time$

$I \leftarrow I \cup {G}$ $i \leftarrow i + N$ $i \leftarrow i + 1$

$I$

[algo:instance] constructs strictly non-overlapping traffic instances from temporally ordered flights using a hybrid windowing strategy. For each candidate group of $N$ consecutive flights, the algorithm commits it as a valid instance only if its temporal span satisfies $Δ T \leq Δ T_{m a x}$ , then advances the window by $N$ flights to prevent overlap. If rejected (i.e., $Δ T > Δ T_{m a x}$ ), the window slides forward by 1 flight to explore alternative groupings while preserving temporal density. This ensures: 1) mutual exclusivity between instances (no shared flights), 2) temporal coherence (all flights within $Δ_{m a x}$ ), and 3) leakage prevention through day-stratified splitting, where all instances from a calendar day reside exclusively in either the training or test set.

Based on the traffic instances, we can perform prediction task via ML. The prediction model in this framework has to be differentiable, we here proposed two simple model as our baseline, including Linear Regression (LR: $f (x) = W x + b$ ) and Multi-Layer Perceptron (MLP): $\begin{aligned} f (x) = f_{2} (ReLU (f_{1} (x))) \\ where: \\ f_{1} (x) = W_{1} x + b_{1} (first layer) \\ f_{2} (x) = W_{2} x + b_{2} (second layer) \\ ReLU (x) = max (0, x) (activation function) \end{aligned}$

As mentioned in 3.1, the output is the predicted transit times ${\hat{T}}_{i}$ for each traffic instances. For the input $x$ , we refer to the common features in previous ETA prediction studies [Zhang et al. 2022; Wang et al. 2018; Lui et al. 2020], including initial position (latitude, longitude, altitude) and operation (heading, speed, descent rate) state for individual aircrafts enter the terminal area.

Decision loss

The decision loss in our framework is based on the SPO loss introduced by [Elmachtoub and Grigas 2022]. This loss measures how well our predicted costs lead to optimal decisions compared to decisions made with true costs. The rigorous unambiguous SPO loss is defined as: $L_{S P O} (\hat{c}, c) = max_{ω \in W^{*} (\hat{c})} (c^{T} ω) - z^{*} (c)$ where:

$W^{*} (\hat{c})$ is the set of optimal solutions using predicted costs $\hat{c}$
$z^{*} (c)$ is the optimal objective value using true costs c
The max operator accounts for multiple optimal solutions that could arise from $\hat{c}$

However, numerical studies in [Tang and Khalil 2024] demonstrate that this rigorous form yields similar results to a simplified version known as “regret”: $L_{S P O} (\hat{c}, c) = c^{T} ω^{*} (\hat{c}) - z^{*} (c)$ where:

$ω^{*} (\hat{c})$ is an optimal solution obtained using predicted costs $\hat{c}$

This measures the gap between the true cost of decisions made by predicted costs $c^{T} ω^{*} (\hat{c})$ , and the best possible cost achievable with true costs $z^{*} (c)$ . While we use this regret formulation for evaluation purposes, it isn’t directly suitable for training due to its computational intractability. In the following section, we introduce the tractable version of SPO functions that enable gradient-based training while maintaining the spirit of optimizing decision loss.

Smart predict-then-optimize plus (SPO+)

Since the SPO is intractable, Elmachtoub and Grigas [Elmachtoub and Grigas 2022] derived a surrogate convex upper bound for SPO called SPO+: $L_{S P O +} (\hat{c}, c) = max_{ω \in S} (c^{T} ω - 2 {\hat{c}}^{T} ω) + 2 {\hat{c}}^{T} ω^{*} (c) - z^{*} (c)$

The computation of SPO+ involves solving a modified optimization problem with costs ( $2 \hat{c} - c$ ) in the forward pass, where the loss is computed with appropriate sign adjustments for maximization problems [Tang and Khalil 2024]. The backward pass then enables end-to-end training by computing gradients based on the difference between true and predicted optimal solutions, scaled by 2 and adjusted for the optimization sense (minimization or maximization).

Case study at London Gatwick Airport

In this paper, we construct our study in London Gatwick Airport. London Gatwick Airport (ICAO: EGKK) serves as a major international aviation hub in the United Kingdom. Operating with a single runway system—unique among airports of its size and traffic volume—Gatwick stands as London’s second-busiest airport and the second-largest single-runway airport globally, located approximately 29.5 miles south of Central London. In 2024 until October, it already handled traffic including both arrivals and departures¹.

Data description

of arrival flights (ADS-B data) at EGKK from June 2024 to September 2024 obtained from OpenSky Network [Schäfer et al. 2014] are used in this study. For the local weather information, we refer to the Meteorological Terminal Aviation Routine Weather Report (METAR) of EGKK in 2024². METAR is a weather report which contains the information for an area enclosed within a $16$ km radius around the airport. Raw METAR data offers a series of weather information, such as wind, temperature, visibility, moisture, etc. Based on the raw METAR data, we apply the air traffic management airport performance (ATMAP) weather algorithm [EUROCONTROL 2011; Lui et al. 2022] to extract the certain scores for each weather component, including wind, visibility, precipitation, freeze condition, and dangerous phenomenon.

2 illustrates sample flights in the scope of our study, capturing the terminal maneuvering area where arriving aircraft perform final approach sequences. The flight trajectories used in this study align with Gatwick Airport’s approach procedures. 3 presents the weather score distribution of EGKK in 2024. As the figure illustrates, wind components are the most significant weather events in EGKK, consistently showing the highest scores throughout the observed period. The wind scores frequently reach values 2.5 on the weather score scale. Precipitation issues also contribute to the overall weather conditions but to a lesser extent. Freeze conditions are more frequent in winter period but less important during summer season. Visibility appears to be relatively minimal, showing lower scores and frequency compared to other weather components. Dangerous phenomena are occasionally recorded but remain relatively rare events in the dataset.

Experiment setup

1 summarizes the key parameters and configurations of our experimental setup. The study encompasses traffic instances from arrivals, with each instance involving 15 aircraft within a 45-minute time interval. The area of interest is confined to a 50 Nautical mile radius around EGKK, providing comprehensive coverage of the TMA.

The key setup for the experiment
Period	June - September 2024
Number of aircraft
Number of traffic instances
Number of aircraft per instances	15
Maximum time interval per instance	45 minutes
Area of interest	50 Nautical miles around EGKK
Machine learning models	{Linear regression; Multi-layer perceptron}
Typical scenarios	{ $W i n d_{m a x}$ , $P r e c i p i t a t i o n_{m a x}$ , $V i s i b i l i t y_{m a x}$ , $D a n g e r o u s P h e n o m e n o n_{m a x}$ , $T i m e_{m i n}$ }
Input features	{Latitude, longitude, velocity, heading angle, vertical rate} at entry state
Output feature	Transit time
Loss function	SPO+, Mean Square Error (Two-stage approach)

As mentioned in 3.2, we implement two ML approaches for our analysis: LR and MLP. For the ASP, the model parameters need to be pre-set and static during the training process, we select typical scenarios from the instances to define the parameters, characterized by maximum weather parameters including wind ( $W i n d_{m a x}$ ), precipitation ( $P r e c i p i t a t i o n_{m a x}$ ), visibility ( $V i s i b i l i t y_{m a x}$ ), and dangerous phenomena ( $D a n g e r o u s P h e n o m e n o n_{m a x}$ ), along with minimum time interval ( $T i m e_{m i n}$ ). Since the scope of our data is from June to September 2024, we do not consider freeze condition in this work. The typical scenario will affect the parameter setting of the optimization model, where $T_{i}$ will be the expected relative transit time to the first entry aircraft within that instance, $E_{i} = T_{i} - 60$ and $L_{i} = T_{i} + 1800$ refers to an open-source ASP benchmark [Prakash et al. 2018; Ikli et al. 2021]³. While this benchmark simplifies aircraft-specific performance, (e.g., it does not dynamically model BADA parameters), it provides a tractable framework for scheduling algorithms. The required separation time $s_{i, j}$ is derived from wake turbulence categories (WTC). Aircraft type codes are mapped to WTC classifications (Light, Medium, Heavy, Jumbo) using the Aircraft Database provided by OpenSky Network [Schäfer et al. 2014]. The required separation time is then determined based on the WTC of the preceding and succeeding aircraft⁴.

The input feature space comprises five key aircraft parameters at the entry state: latitude, longitude, velocity, heading angle, and vertical rate. These parameters capture the essential initial conditions of each aircraft’s trajectory. The models are trained to predict the transit time as the output feature. For model optimization, we employ two distinct loss functions: SPO+, and Mean Square Error (MSE) in a two-stage approach. The ratio between training sets and test sets are $8 : 2$ . The batch size is 32 and number of epochs is 20.

Results and Discussion

In this section, we will present the results and corresponding discussions. First, 5 illustrates the learning curves for both loss functions on the training sets using normalized loss values. The SPO+ and two-stage approaches exhibit distinctly different convergence behaviors during training. The SPO+ loss curves show rapid initial decrease and stabilize at very low normalized loss values (below 0.1) across all scenarios by around iteration 250. This consistent convergence pattern appears similar for both Linear Regression and MLP implementations.

The learning curves for two-stage approach and SPO+ on training sets

The two-stage approach, however, demonstrates markedly different behavior. While the Linear Regression variants show quick initial convergence, MLP implementations maintain relatively high normalized loss values (fluctuating between 0.2 and 0.6) throughout training. The learning curves show considerable oscillation, particularly for the maximum danger scenario, suggesting potential stability issues in the optimization process.

This performance discrepancy suggests that for subsequent analyses, focus should be directed toward three specific configurations: LR + Two-Stage, MLP + SPO+, and LR + SPO+. The MLP + Two-Stage configuration can be reasonably excluded from further investigation due to its demonstrated inferior convergence properties.

Following the first analysis on learning curves, 6 presents the normalized regret distribution during training process for test sets. Our experimental results demonstrate the effectiveness of end-to-end decision-focused learning approaches, particularly when combined with more expressive model architectures. The MLP + SPO+ implementation consistently achieves superior performance across most typical scenarios, exhibiting lower normalized regret compared to both LR + SPO+ and LR + Two-Stage approaches.

To rigorously assess the performance differences between approaches, we employed the Mann-Whitney U test, a non-parametric statistical test that evaluates whether two independent samples come from the same distribution. This test is particularly appropriate for our analysis as it makes no assumptions about the normality of the data and is well-suited for comparing the regret distributions. A lower $U$ -statistic indicates greater separation between the distributions.

Statistical analysis reveals particularly significant differences in the maximum wind scenario, where MLP + SPO+ significantly outperforms the two-stage approach ( $U = 130.0$ , $p = 0.024$ ). This advantage is also suggested, though not statistically significant at the $α = 0.05$ level, in the maximum dangerous phenomenon scenario ( $U = 147.0$ , $p = 0.066$ ). These findings support the hypothesis that the ability to capture non-linear relationships proves beneficial in complex scenarios.

Normalized regret distributions across test sets for different scenarios.

Interestingly, while SPO+ generally shows favorable performance, the statistical tests reveal no significant differences between LR + SPO+ and LR + Two-Stage across most scenarios (all $p > 0.05$ ), with particularly similar performance in maximum precipitation ( $U = 208.0$ , $p = 0.763$ ) and maximum visibility ( $U = 250.5$ , $p = 0.458$ ) scenarios. We further conducted additional Mann-Whitney U tests on the union of all data subsets (combining all weather scenarios). These aggregated results confirm no statistically significant differences between any of the methods, SPO+ (MLP) vs Two-Stage (LR): $U = 5248.0$ , $p = 0.549$ . This comprehensive analysis across all weather conditions further nuances our understanding of the relative performance of these approaches, suggesting that the advantages of SPO+ might be more subtle than initially apparent in certain contexts.

The variation in performance across different architectures and optimization frameworks provides valuable insights for practical implementations. The notable success of MLP + SPO+ not only demonstrates the advantage of end-to-end SPO learning but also highlights the importance of model expressiveness in capturing complex weather-related patterns. These findings suggest that while SPO+ generally provides stronger performance, the choice of underlying model architecture significantly influences the overall effectiveness of the optimization framework.

The mean cost comparison between FCFS, optimized true cost, and optimized predicted cost on test sets

The next analysis compares the optimized costs of SPO+ and two-stage approaches based on the trained ML models. We input the features of test sets to predict the costs and use these predictions to optimize each instance via the ASP in the test sets (7). With a cost of when optimizing using true landing times—representing the minimum achievable average cost under the specific scheduling constraints we defined—both MLP+SPO+ and LR+Two-Stage methods significantly outperform the FCFS baseline of . Interestingly, the MLP+SPO+ shows particularly strong performance in scenarios optimized for minimum time interval, achieving a mean cost of compared to Two-Stage’s . This outperformance relative to the "optimal true cost" does not indicate a violation of optimization principles, but rather highlights a key insight: optimization using true landing times isn’t necessarily optimal for the complete operational context. The SPO+ approach can discover solutions that account for broader operational dynamics and uncertainty patterns that aren’t captured when directly optimizing with true landing times. The most significant insight emerges from examining performance across different weather conditions: while the Two-Stage approach maintains relatively uniform costs across all scenarios, the SPO+ method demonstrates sophisticated adaptation to weather conditions, strategically accepting higher transit times under challenging conditions while finding better overall solutions.

This weather-responsive behavior of MLP+SPO+ represents a crucial advancement in arrival scheduling optimization. The systematically higher costs observed under extreme weather scenarios (ranging from to ) indicate that the model effectively incorporates weather-related risks into its decision-making process, making more conservative prediction when conditions are adverse. In contrast, the Two-Stage approach’s more uniform cost distribution suggests a limitation in capturing the complex interplay between weather conditions and optimal routing decisions. These findings indicate that while SPO+ might occasionally suggest higher transit times compared to the optimal true cost, these decisions reflect a trade-off between speed and safety, demonstrating the method’s capability to make more nuanced, context-aware aircraft arrival scheduling decisions.

In addition to algorithmic analysis, we perform a delay assignment analysis to evaluate the fairness consideration in this model. We use the transit time difference for each aircraft and the number of shifting for maximum precipitation scenario. This scenario is selected because it has the largest total cost for MLP+SPO+. comparing between MLP+SPO+ and optimization using true cost.

[Table:delay] reveals that MLP+SPO+ demonstrates improved fairness compared to optimization with true cost, as evidenced by lower mean transit time differences (18.60s vs. 43.62s), reduced standard deviation (181.67s vs. 236.69s), and fewer position shifts per instance (13 vs. 17) in the maximum precipitation scenario. Consider we have 15 aircraft per instance, MLP+SPO+ can achieve average less than 1 position shifting for each aircraft. However, since neither MLP+SPO+ nor the baseline explicitly incorporates fairness parameters, both methods exhibit high variability in transit time differences, reflected in the large standard deviations. This suggests that while MLP+SPO+ achieves better fairness outcomes implicitly through its learning framework, the absence of fairness-aware optimization leads to inconsistent treatment of individual aircraft. The results highlight the potential for further improvements by integrating fairness constraints directly into the model to reduce disparity and stabilize outcomes.

Conclusion

This paper presents an application of the SPO framework to the Aircraft Arrival Scheduling Problem within Terminal Maneuvering Area. We developed an end-to-end learning approach that integrates arrival flight time prediction with scheduling optimization, specifically focusing on London Gatwick Airport operations. Our methodology introduces the concept of traffic instances for simultaneous prediction of multiple aircraft arrival times, coupled with a Mixed Integer Programming model for optimal aircraft arrival scheduling decisions.

The experimental results demonstrate several key findings. First, the MLP+SPO+ implementation consistently outperforms traditional two-stage approaches across most scenarios, particularly with complex weather conditions. The framework shows sophisticated adaptation to varying weather conditions, strategically accepting higher transit times under adverse conditions while maintaining operational efficiency. When the minimum time interval is required, the MLP+SPO+ will suggest around $43.4 %$ lower costs compared with the true cost. Second, our analysis reveals that while simpler LR models with two-stage optimization can sometimes match SPO+ performance in specific scenarios (particularly low visibility conditions), the end-to-end approach generally provides more robust and adaptable solutions.

A critical consideration for practical implementation is balancing operational efficiency with ATC manageability and fairness to airlines. FCFS scheduling is conventionally favored for its simplicity and perceived fairness. Our proposed framework demonstrates that optimized sequences can achieve significant cost reductions without inherently compromising these priorities. Compared with benchmark optimization, MLP+SPO+ demonstrates enhanced fairness.

However, our study identifies important limitations and areas for refinement. Methodologically, our focus on isolating the SPO+ loss function’s impact led us to maintain consistency by using unnormalized inputs and Gradient Descent (GD) optimization across the compared methods (e.g., LR+SPO+ vs. LR+2S). While this consistency aids in evaluating the relative benefit of the SPO+ loss, it presents trade-offs. Using unnormalized inputs might not yield the absolute peak performance, particularly for MLP architectures known to benefit from normalization, although our results still confirmed the SPO+ advantage. Similarly, while GD (or other gradient-based methods) is inherent to optimizing the SPO+ loss, applying it to the LR+2S baseline (instead of standard OLS) ensures optimizer consistency for comparison but deviates from typical standalone LR practices. Furthermore, as our experiments suggested, optimal training, particularly concerning input normalization, appears sensitive to hyperparameter calibration, especially for LR models under GD where we encountered convergence challenges with normalization in our initial trials. Beyond these methodological considerations, a significant constraint remains the current SPO framework’s reliance on fixed optimization model structures (beyond objective costs), limiting adaptability to scenarios with varying constraints. Computational efficiency for larger instances and the lack of explicit fairness mechanisms, potentially leading to higher variation in delay assignment, are also key concerns.

Looking ahead, several promising research directions emerge. Extending the SPO framework itself, perhaps incorporating dynamic MIP parameter updates [Hu et al. 2023] and regret computations, is a key avenue. This could involve exploring diverse neural network architectures for traffic instance cost prediction. Crucially, a systematic investigation into the interplay between input normalization techniques, hyperparameter tuning, and model performance (both SPO+ and baselines) is warranted. This includes exploring individually optimized configurations, potentially using OLS for LR+2S baselines when comparing absolute achievable performance rather than isolating loss function effects. Improving computational efficiency, possibly through optimization problem relaxations, remains vital. The framework’s principles could also be extended to related scheduling or routing problems [Graham et al. 1979; Bianco et al. 1993], and transfer learning could enhance applicability across different airports. Lastly, systematically addressing fairness is essential. Future work should explicitly incorporate airline equity metrics (e.g., delay distribution thresholds) as constraints or weighted objectives in the optimization model, better aligning the framework with real-world ATC priorities while preserving its efficiency advantages.

These findings and identified future directions contribute to the growing body of research on ML applications in air traffic management, particularly in the critical area of arrival scheduling optimization. The demonstration of end-to-end SPO learning approaches suggests potential for further development and practical implementation in real-world airport operations.

Author contributions

Go Nam Lui: Conceptualization, methodology, formal analysis, data curation, software, resources, writing – original draft, writing – review & editing, visualization. Soner Demirel: Conceptualization, data curation, writing – original draft, writing – review & editing.

Funding statement

Go Nam Lui receives funding from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant numbers 10086651 (Lancaster University)]. Opinions expressed in this work reflect the authors views only, and the SESAR 3 JU and UKRI are not responsible for any use that may be made of the information contained herein.

Open data statement

All data analyzed during this study are publicly available in https://zenodo.org/records/14014439.

Reproducibility statement

The source code of this research is stored at https://github.com/harrylui1995/ASP_E2EPO.

Ayhan, S., Costas, P., and Samet, H. 2018. Predicting estimated time of arrival for commercial flights. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 33–42.

Beasley, J.E., Krishnamoorthy, M., Sharaiha, Y.M., and Abramson, D. 2000. Scheduling aircraft landings—the static case. Transportation science 34, 2, 180–197.

Beasley, J.E., Sonander, J., and Havelock, P. 2001. Scheduling aircraft landings at london heathrow using a population heuristic. Journal of the Operational Research Society 52, 5, 483–493.

Bianco, L., Mingozzi, A., and Ricciardelli, S. 1993. The traveling salesman problem with cumulative costs. Networks 23, 2, 81–91.

Boursier, L., Favennec, B., Hoffman, E., Trzmiel, A., Vergne, F., and Zeghal, K. 2007. Merging arrival flows without heading instructions. 7th USA/europe air traffic management r&d seminar, 1–8.

Cook, A.J. and Tanner, G. 2011. European airline delay cost reference values.

Du, Z., Zhang, J., and Kang, B. 2023. A data-driven method for arrival sequencing and scheduling problem. Aerospace 10, 1, 62.

Elmachtoub, A.N. and Grigas, P. 2022. Smart “predict, then optimize.” Management Science 68, 1, 9–26.

EUROCONTROL. 2011. Algorithm to describe weather conditions at European airports. https://www.eurocontrol.int/sites/default/files/publication/files/algorithm-met-technical-note.pdf.

Glina, Y., Jordan, R., and Ishutkina, M. 2012. A tree-based ensemble method for the prediction and uncertainty quantification of aircraft landing times. American meteorological society–10th conference on aritificial intelligence applications to environmental science, new orleans, LA.

Graham, R.L., Lawler, E.L., Lenstra, J.K., and Kan, A.R. 1979. Optimization and approximation in deterministic sequencing and scheduling: A survey. In: Annals of discrete mathematics. Elsevier, 287–326.

Hu, X., Lee, J.C., and Lee, J.H. 2023. Predict+ optimize for packing and covering LPs with unknown parameters in constraints. Proceedings of the AAAI conference on artificial intelligence, 3987–3995.

Ikli, S., Mancel, C., Mongeau, M., Olive, X., and Rachelson, E. 2021. The aircraft runway scheduling problem: A survey. Computers & Operations Research 132, 105336.

Kern, C.S., Medeiros, I.P. de, and Yoneyama, T. 2015. Data-driven aircraft estimated time of arrival prediction. 2015 annual IEEE systems conference (syscon) proceedings, IEEE, 727–733.

Li, K. and Malik, J. 2016. Learning to optimize. arXiv preprint arXiv:1606.01885.

Lui, G.N., Hon, K.K., and Liem, R.P. 2022. Weather impact quantification on airport arrival on-time performance through a bayesian statistics modeling approach. Transportation Research Part C: Emerging Technologies 143, 103811.

Lui, G.N., Klein, T., and Liem, R.P. 2020. Data-driven approach for aircraft arrival flow investigation at terminal maneuvering area. AIAA aviation forum, 2869.

Lui, G.N., Nguyen, C.H., Hui, K.Y., Hon, K.K., and Liem, R.P. 2025. Enhancing aircraft arrival transit time prediction: A two-stage gradient boosting approach with weather and trajectory features. Journal of the Air Transport Research Society 4, 100062.

Ma, Y., Du, W., Chen, J., Zhang, Y., Lv, Y., and Cao, X. 2022. A spatiotemporal neural network model for estimated-time-of-arrival prediction of flights in a terminal maneuvering area. IEEE Intelligent Transportation Systems Magazine 15, 1, 285–299.

Messaoud, M.B. 2021. A thorough review of aircraft landing operation from practical and theoretical standpoints at an airport which may include a single or multiple runways. Applied Soft Computing 98, 106853.

Pang, Y., Zhao, P., Hu, J., and Liu, Y. 2024. Machine learning-enhanced aircraft landing scheduling under uncertainties. Transportation Research Part C: Emerging Technologies 158, 104444.

Pohl, M., Kolisch, R., and Schiffer, M. 2021. Runway scheduling during winter operations. Omega 102, 102325.

Prakash, R., Piplani, R., and Desai, J. 2018. An optimal data-splitting algorithm for aircraft scheduling on a single runway to maximize throughput. Transportation Research Part C: Emerging Technologies 95, 570–581.

Sáez, R., Prats, X., Polishchuk, T., and Polishchuk, V. 2020. Traffic synchronization in terminal airspace to enable continuous descent operations in trombone sequencing and merging procedures: An implementation study for frankfurt airport. Transportation Research Part C: Emerging Technologies 121, 102875.

Sama, M., D’Ariano, A., Toli, A., Pacciarelli, D., and Corman, F. 2015. A variable neighborhood search for optimal scheduling and routing of take-off and landing aircraft. 2015 international conference on models and technologies for intelligent transportation systems (MT-ITS), IEEE, 491–498.

Schäfer, M., Strohmeier, M., Lenders, V., Martinovic, I., and Wilhelm, M. 2014. Bringing up OpenSky: A large-scale ADS-b sensor network for research. IPSN-14 proceedings of the 13th international symposium on information processing in sensor networks, IEEE, 83–94.

Silvestre, J., Martı́nez-Prieto, M.A., Bregon, A., and Álvarez-Esteban, P.C. 2024. A deep learning-based approach for predicting in-flight estimated time of arrival. The Journal of Supercomputing, 1–35.

Sprong, K.R., Haltli, B.M., DeArmon, J.S., Bradley, S., et al. 2005. Improving flight efficiency through terminal area RNAV. 6th USA/europe air traffic management r&d seminar.

Takacs, G. 2014. Predicting flight arrival times with a multistage model. 2014 IEEE international conference on big data (big data), IEEE, 78–84.

Tang, B. and Khalil, E.B. 2024. PyEPO: A PyTorch-based end-to-end predict-then-optimize library for linear and integer programming. Mathematical Programming Computation.

Wang, Z., Liang, M., and Delahaye, D. 2018. A hybrid machine learning model for short-term estimated time of arrival prediction in terminal manoeuvring area. Transportation Research Part C: Emerging Technologies 95, 280–294.

Xu, B. 2017. An efficient ant colony algorithm based on wake-vortex modeling method for aircraft scheduling problem. Journal of Computational and Applied Mathematics 317, 157–170.

Zhang, J., Peng, Z., Yang, C., and Wang, B. 2022. Data-driven flight time prediction for arrival aircraft within the terminal area. IET Intelligent Transport Systems 16, 2, 263–275.