Generative Short-Term Aircraft Trajectory Prediction with Conditional Flow Matching

Benoit Figuet; Timothé Krauth; Steve Barry;
This web version is automatically generated from the LaTeX source and may not include all elements. For complete details, please refer to the PDF version.

Abstract

Reliable short-term aircraft trajectory prediction is essential for safety and efficiency in Air Traffic Management (ATM). This work introduces a generative framework for probabilistic 4D trajectory forecasting based on Conditional Flow Matching (CFM), a recent deep generative modeling approach that combines stable likelihood-based training with efficient sampling. The model is trained on historical ADS–B data from the OpenSky Network to predict aircraft motion over a 60 s horizon, conditioned on the preceding 60 s of observations. The model generates ensembles of realistic future trajectories that capture the inherent uncertainty of aircraft motion and enable probabilistic assessment of potential conflicts. As an application, we estimate the probability of mid-air collision during a loss-of-separation event using Monte Carlo simulation over the generated trajectories, providing a quantitative risk measure. The results demonstrate that flow-based generative modeling offers a principled foundation for uncertainty-aware trajectory prediction and safety analysis in ATM.

Introduction

Reliable short-term aircraft trajectory prediction is fundamental to safe and efficient Air Traffic Management (ATM). Operational safety nets such as TCAS II [Munoz et al. 2013] and Short-Term Conflict Alert (STCA) [2017] rely on linear extrapolations to generate collision alerts. While simple and robust, such deterministic approaches cannot capture the uncertainty and variability inherent in real-world trajectories.

Effective short-term trajectory prediction (STTP) algorithms have immediate benefit to Air Navigation Service Providers (ANSPs) and regulators. A central task for ANSPs is assessing risk for numerous airspace occurrences, such as a loss of separation (LOS) or TCAS events, as well as thousands of conflicts detected by data mining all surveillance tracks. An essential component of this assessment is determining whether each detected conflict is real or a false positive, for instance in cases where aircraft were expected to turn as part of a published procedure before any potential collision.

As illustrated schematically in Figure 1, linear extrapolation can indicate a high-risk situation if no deviation occurs, yet it is often unclear whether the aircraft had intended to turn as part of its standard path. Historical trajectories can reveal whether an aircraft was following an established procedure or deviating from it, thus determining whether the risk was genuine or merely apparent. In practice, this distinction is rarely binary: large-scale surveillance data exhibit significant variability, and data-driven models are needed to capture the range of plausible futures consistent with observed intent.


Schematic illustration of conflict evaluation for short-term trajectory prediction. (a) A potential conflict between two aircraft is detected from surveillance tracks. (b) Linear extrapolation of the current trajectories indicates a possible collision at the projected point of highest risk. (c) Historical trajectories reveal that the aircraft were expected to turn as part of a standard procedure, suggesting a false positive. (d) Historical trajectories reveal that the red aircraft was expected to continue straight, suggesting a true positive. The history of trajectories provides probabilistic evidence of intent, enabling data-driven classification of real versus false conflicts.

In recent years, research has increasingly explored data-driven prediction methods to address these challenges. Liu and Hansen [Liu and Hansen 2018] proposed a deep generative convolutional recurrent network for multimodal trajectory prediction, while Krauth et al.[Krauth et al. 2021] introduced multivariate density models to synthesize realistic aircraft trajectories. Jarry et al. [Jarry et al. 2019] employed a Generative Adversarial Network (GAN) to learn the probability distributions of real aircraft approach paths, enabling the generation of realistic trajectories and the detection of atypical flight behaviors. Zeng et al. [Zeng et al. 2022] provide a comprehensive review of trajectory prediction techniques, emphasizing both progress and remaining challenges. Despite these advances, most models still predict a single deterministic trajectory, making uncertainty quantification difficult.

To address this limitation, Krauth et al. [Krauth et al. 2025] recently proposed a multi-objective CNN–LSTM architecture that predicts not only the expected trajectory but also spatio-temporal confidence areas, enabling the construction of 95% prediction intervals for each state component. These developments highlight a growing recognition that uncertainty-aware prediction is essential for robust ATM applications.

Building on these efforts, this paper introduces a generative framework for probabilistic short-term trajectory forecasting based on Conditional Flow Matching (CFM) [Lipman et al. 2023], which learns to transform random noise into plausible trajectories via ordinary differential equations (ODEs). Given one minute of observation, a Transformer-based conditional flow estimates the distribution of the trajectory for the next minute conditioned on observed inputs. The one-minute prediction horizon is chosen to align with the operational timescales of airborne safety nets such as TCAS whose alerting logic typically operates within a 30–45 s look-ahead window [Munoz et al. 2013]. In practice, the proposed framework is not limited to this horizon and can be extended to longer prediction intervals as required by specific Air Traffic Management applications.

This formulation allows the generation of multiple plausible future trajectories that capture the stochastic nature of real aircraft motion, offering a data-driven means to assess both real and false conflicts within a probabilistic risk-assessment framework. We favor CFM over GANs [Goodfellow et al. 2020], VAEs [Kingma and Welling 2013], or standard diffusion models [Ho et al. 2020] because it affords explicit likelihood training and hence naturally calibrated uncertainties, uses a stable regression-based objective that avoids the adversarial instabilities of GANs and the heavy noise-schedule simulation burden of diffusion models, and enables faster inference via simpler flow paths with fewer integration steps [Tong et al. 2023].

Background: Conditional Flow Matching

Flow Matching (FM) is a framework for training generative models via continuous flows [Lipman et al. 2023; Liu et al. 2022]. The key idea is to describe the transformation from a simple base distribution (e.g., Gaussian noise) to a complex data distribution (e.g., aircraft trajectories) as the solution of an ordinary differential equation (ODE) driven by a time-dependent vector field vtv_t. A flow ϕt\phi_t, defined as the solution to this ODE, maps samples from the prior to the data space. Flow Matching provides a simulation-free method to learn this vector field by regressing it against a target vector field utu_t that generates a desired probability path {pt}t[0,1]\{p_t\}_{t \in [0,1]} connecting a prior distribution p0p_0 to a target data distribution p1p_1.

This section summarizes the Flow Matching and Conditional Flow Matching results introduced by Lipman et al. [Lipman et al. 2023] and further developed in subsequent lecture notes [Holderrieth and Erives 2025].

Flow Matching

Let p(x)p(x) be a simple, tractable prior distribution (e.g., a standard normal 𝒩(x|0,I)\mathcal{N}(x|0,I)) and let q(x1)q(x_1) be the target data distribution from which we can draw samples. We consider a probability path ptp_t such that p0=pp_0=p and p1p_1 approximates qq. This path is generated by an unknown target vector field utu_t. The goal is to train a neural network vt(x;θ)v_t(x;\theta) to approximate utu_t.

The Flow Matching (FM) objective is a regression loss defined as: FM(θ)=𝔼tU[0,1],xpt(x)vt(x;θ)ut(x)2.\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t \sim U[0,1],\, x \sim p_t(x)} \left\lVert v_t(x;\theta) - u_t(x) \right\rVert^2. \label{eq:fm-loss} Here \lVert\cdot\rVert denotes the Euclidean (2\ell_2) norm. Minimizing this objective forces the learned vector field vtv_t to match the target field utu_t. At inference, we can generate new samples by solving the initial value problem ddtXt=vt(Xt;θ)\frac{d}{dt}X_t = v_t(X_t;\theta) from t=0t=0 to t=1t=1, with X0p(x)X_0 \sim p(x). However, this objective is intractable because both the marginal path pt(x)p_t(x) and its vector field ut(x)u_t(x) are generally unknown.

Conditional Flow Matching

CFM reformulates the problem to be solvable in practice. The core idea is to construct the intractable marginal path pt(x)p_t(x) by marginalizing over a set of simpler, per-sample conditional probability paths pt(x|x1)p_t(x|x_1): pt(x)=pt(x|x1)q(x1)dx1.p_t(x) = \int p_t(x|x_1)q(x_1)dx_1. Each conditional path is designed to start from the prior at t=0t=0 (i.e., p0(x|x1)=p(x)p_0(x|x_1) = p(x)) and end in a distribution concentrated around a specific data sample x1x_1 at t=1t=1. The corresponding marginal vector field ut(x)u_t(x) can also be expressed as an aggregation of the conditional vector fields ut(x|x1)u_t(x|x_1).

The key insight of CFM is that the gradients of the intractable FM objective [eq:fm-loss] are identical to the gradients of a much simpler objective that uses the conditional paths directly. The CFM objective is: CFM(θ)=𝔼tU[0,1],x1q(x1),xpt(x|x1)vt(x;θ)ut(x|x1)2.\mathcal{L}_{\mathrm{CFM}}(\theta) = \mathbb{E}_{t \sim U[0,1],\,x_1 \sim q(x_1),\,x \sim p_t(x|x_1)} \left\lVert v_t(x;\theta) - u_t(x|x_1) \right\rVert^2. \label{eq:cfm-loss} This loss is tractable because sampling from pt(x|x1)p_t(x|x_1) and evaluating its vector field ut(x|x1)u_t(x|x_1) can be done in closed form for well-chosen conditional paths.

Gaussian and Optimal Transport Paths

A general and effective choice for the conditional paths are Gaussian paths of the form: pt(x|x1)=𝒩(x|μt(x1),σt(x1)2I),p_t(x|x_1) = \mathcal{N}\big(x|\mu_t(x_1), \sigma_t(x_1)^2 I\big), which define a smooth probability path from the base distribution at t=0t{=}0 (typically standard normal noise) to a distribution concentrated around a particular data example x1x_1 at t=1t{=}1. Intuitively, μt\mu_t controls the drift toward x1x_1 while σt\sigma_t controls how quickly uncertainty is removed along the path. where the time-dependent mean μt(x1)\mu_t(x_1) and standard deviation σt(x1)\sigma_t(x_1) satisfy the boundary conditions μ0(x1)=0,σ0(x1)=1\mu_0(x_1)=0, \sigma_0(x_1)=1 and μ1(x1)=x1,σ1(x1)=σmin\mu_1(x_1)=x_1, \sigma_1(x_1)=\sigma_{\min}, with σmin\sigma_{\min} being a small positive constant. In the remainder of this paper, we use the common simplifying choice σmin=0\sigma_{\min}=0 (deterministic endpoint at t=1t{=}1), which yields the linear interpolation in Eq. [eq:fm_target]. The vector field that generates this path is given by: ut(x|x1)=σt(x1)σt(x1)(xμt(x1))+μt(x1).u_t(x|x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}\big(x - \mu_t(x_1)\big) + \mu_t'(x_1). \label{eq:cfm-vectorfield} Here, primes denote derivatives with respect to the scalar flow-time tt (holding x1x_1 fixed).

A particularly powerful instance uses linear schedules for the mean and standard deviation, which corresponds to the Optimal Transport (OT) displacement interpolant between the Gaussians at t=0t=0 and t=1t=1. Setting μt(x1)=tx1andσt(x1)=1(1σmin)t,\mu_t(x_1) = t\,x_1 \quad \text{and} \quad \sigma_t(x_1) = 1 - (1-\sigma_{\min})\,t, the target vector field in Equation [eq:cfm-vectorfield] simplifies to: ut(x|x1)=x1(1σmin)x1(1σmin)t.u_t(x|x_1) = \frac{x_1 - (1-\sigma_{\min})x}{1 - (1-\sigma_{\min})t}. This vector field has a direction that is constant over time, making it simpler for a neural network to learn. The resulting paths move along straight line trajectories from noise to data, leading to more efficient training and sampling.

CFM in Practice:

We train a vector-field network vt(x;θ)v_t(x;\theta) (optionally vt(x,c;θ)v_t(x,c;\theta) with context cc) that predicts the instantaneous velocity of a sample xx along a probability path from a simple prior p0p_0 to the data distribution. Conceptually, vtv_t replaces the unknown target field utu_t and encodes “how to move” data at each time t[0,1]t\in[0,1].

Learning is pure regression: minimize the Conditional Flow Matching loss in [eq:cfm-loss], which is an MSE between the network and a closed-form target field ut(xx1)u_t(x\!\mid\!x_1) defined by your chosen conditional path (e.g., the Gaussian/OT path of [eq:cfm-vectorfield]). This objective provides unbiased gradients for the intractable FM loss and requires no likelihoods, scores, or simulation of trajectories during training.

At each iteration, draw a random time tt, a data example x1x_1, and a synthetic point xpt(xx1)x\sim p_t(x\!\mid\!x_1) from the conditional path; compute the analytic target ut(xx1)u_t(x\!\mid\!x_1); regress vt(x;θ)v_t(x;\theta) toward it with MSE. Repeat over mini-batches with your optimizer of choice.

After training, generate by integrating the learned ODE ddtXt=vt(Xt;θ)\frac{d}{dt}X_t=v_t(X_t;\theta) from t=0t=0 to t=1t=1 starting at X0p0X_0\sim p_0 (e.g., standard normal). Any standard ODE solver (Euler/Heun/RK) with a modest number of steps suffices; X1X_1 is the synthesized sample.

Methodology

Data and Preprocessing

We use one month of ADS–B surveillance data from the OpenSky Network [Schäfer et al. 2014], restricted to flights above FL195 within the Swiss Free Route Airspace (FRA) and collected with the traffic library [Olive 2019]. All trajectories are resampled at 1 Hz.

From ADS–B state vectors, we derive a consistent kinematic representation. Latitude and longitude are projected to the Swiss projected grid (CH1903+/LV95; EPSG:2056), yielding planar coordinates (x,y)(x,y) with xx Easting and yy Northing; altitude is converted to meters (z)(z). Groundspeed vv and track angle θ\theta (clockwise from North) define horizontal velocity components (vx=vsinθ,vy=vcosθ)(v_x = v\sin\theta,\; v_y = v\cos\theta), while the vertical rate provides vzv_z (ft/min converted to m/s). A turn-rate proxy ψ̇\dot{\psi} is computed from the unwrapped angular change between successive horizontal velocity vectors, divided by the sampling interval (1 s), and clipped to ±0.25rad/s\pm 0.25\,\text{rad/s} to suppress outliers. Each trajectory point in the global frame is thus (x,y,z,vx,vy,vz,ψ̇),(x,\, y,\, z,\, v_x,\, v_y,\, v_z,\, \dot{\psi}), a 7-dimensional state encoding position and motion.

Examples are constructed as sliding windows across flights. Each sample comprises 6060 s of history sampled at 1 Hz and a 6060 s prediction horizon; futures are down-sampled every 55 s, yielding 1212 targets per window. Splits are performed at flight level to eliminate leakage between train, validation, and test sets. The training set contains 1,000,0001{,}000{,}000 input–output pairs, while the validation and test sets each contain 200,000200{,}000 samples.

To ensure exposure to maneuvering behavior, at least 30%30\% of the training and validation samples contain a turn, defined as 3\geq 3 consecutive steps with ψ̇>0.01rad/s\dot{\psi} > 0.01\,\text{rad/s} occurring in the history and/or future portion of the window; remaining samples are drawn uniformly to preserve overall traffic statistics. The test set is sampled uniformly without turn constraints.

To reduce variance and improve generalization, we transform each window into an aircraft-centric frame that is fixed by the last observed state. The last observed position defines the origin, and the last horizontal velocity vector defines the forward axis; the frame does not rotate over the prediction horizon. We denote aircraft-centric quantities with tildes. The per-timestep input sequence is (x̃,ỹ,z̃,ṽx,ṽy,ṽz,ψ̇),(\,\tilde{x},\, \tilde{y},\, \tilde{z},\, \tilde{v}_x,\, \tilde{v}_y,\, \tilde{v}_z,\, \dot{\psi}\,), where (x̃,ỹ,z̃)(\tilde{x},\tilde{y},\tilde{z}) and (ṽx,ṽy,ṽz)(\tilde{v}_x,\tilde{v}_y,\tilde{v}_z) are positions and velocities expressed in this fixed local frame, while the turn rate ψ̇\dot{\psi} is unchanged.

In addition, we provide an 8-dimensional context vector that captures the absolute reference state at the history endpoint: (xabs,yabs,zabs,cosθ,sinθ,vgs,last,vz,last,ψ̇last),\bigl(\, x_{\text{abs}},\, y_{\text{abs}},\, z_{\text{abs}},\, \cos\theta,\, \sin\theta,\, v_{\text{gs,last}},\, v_{z,\text{last}},\, \dot{\psi}_{\text{last}} \,\bigr), where (xabs,yabs,zabs)(x_{\text{abs}},y_{\text{abs}},z_{\text{abs}}) are absolute LV95 coordinates (m), (cosθ,sinθ)(\cos\theta,\sin\theta) encode the track angle, vgs,lastv_{\text{gs,last}} is ground speed, vz,lastv_{z,\text{last}} is vertical speed, and ψ̇last\dot{\psi}_{\text{last}} is the final turn rate. Thus, the model ingests (i) a 7-D aircraft-centric trajectory sequence capturing local dynamics and (ii) an 8-D global context anchoring the sequence in absolute space and orientation.

Both the sequence and the context are standardized using their own mean and variance, estimated on the training set and applied to all splits.

Model Architecture

Our predictor is a Transformer encoder–decoder that learns to map an observed flight history to a distribution of future trajectories. The model operates on three types of input:

  1. History sequence: the last 6060 s of aircraft motion (7 features per timestep).

  2. Context vector: an 8-D descriptor of the aircraft’s absolute position and orientation at the end of the history.

  3. Noisy future: a sequence of future states during training (interpolated between Gaussian noise and target trajectory); during inference, this starts as pure Gaussian noise and gets progressively denoised to generate predictions.

History encoder:

The 60 history steps (7 features each) are first linearly projected to 512 dimensions and enriched with positional encodings. The 8-D global context vector is mapped to 512 dimensions and prepended as an extra token at the front of the sequence, so that the Transformer can jointly attend to context and history (akin to a [CLS] token [Devlin et al. 2019]). This combined sequence of 61 tokens (each 512-D) is processed by six Transformer encoder layers, producing a latent representation of the past trajectory that serves as memory for the decoder.

Time embedding:

The flow-matching process depends on a scalar time variable t[0,1]t \in [0,1], which indicates how far we are between pure noise (t=0t=0) and the true future (t=1t=1). To make this information usable by the Transformer, tt is first expanded into sinusoidal features using 64 frequencies (yielding 128 features: sine and cosine). These are passed through a small MLP: a fully connected layer maps 128 inputs to 256 hidden units with SiLU activation, followed by a second fully connected layer mapping 256 to 512 units. The resulting 512-D time embedding is added to both the noisy future tokens and to the encoded history, so the model always knows “when” in the flow it is operating.

Future denoiser:

The noisy future sequence is projected into the latent space and processed by an eight-layer Transformer decoder with self-attention across future tokens and cross-attention to the encoded history. Finally, a linear layer maps the decoder output back to 7 physical features per step, representing the predicted vector field vθ(xt,thistory,context)12×7.v_\theta(x_t, t \mid \text{history}, \text{context}) \in \mathbb{R}^{12 \times 7}.

Architecture of the flow-matching Transformer model. The history encoder processes past trajectory data and context, the time embedding provides temporal conditioning, and the future denoiser generates denoised predictions.

In essence, the encoder compresses the past 60 s of motion into a latent memory, the time embedding guides how noise is transformed along the flow, and the decoder denoises 12 future steps conditioned on both history and context. The complete architecture is illustrated in Figure 2.

Training and Inference

Let the normalized future be x1K×7x_1 \in \mathbb{R}^{K \times 7} and partition the channel dimension as x1=(x1pos,x1vel,x1ψ)x_1=(x_1^{\text{pos}},\,x_1^{\text{vel}},\,x_1^{\psi}) with shapes K×3K{\times}3, K×3K{\times}3, and K×1K{\times}1, respectively, where KK denotes the number of time steps predicted.

We instantiate the OT-style Gaussian conditional path of Section 2.3 with σmin=0\sigma_{\min}=0. Sample 𝛆𝒩(0,I)\boldsymbol{\varepsilon}\sim\mathcal{N}(0,I) and tU[0,1]t\sim U[0,1], and form xt=(1t)𝛆+tx1,xtx1𝒩(tx1,(1t)2I),\begin{aligned} x_t &= (1-t)\,\boldsymbol{\varepsilon} + t\,x_1, \label{eq:fm_target_xt}\\ x_t \mid x_1 &\sim \mathcal{N}\!\bigl(t\,x_1,\,(1-t)^2 I\bigr), \nonumber \end{aligned}

and define the conditional target vector field u=ut(xtx1)=x1xt1t=x1𝛆.u \;=\; u_t(x_t\mid x_1) \;=\; \frac{x_1 - x_t}{1-t} \;=\; x_1 - \boldsymbol{\varepsilon}. \label{eq:fm_target}

This corresponds to the OT-style Gaussian path with σmin=0\sigma_{\min}=0 (so xtx_t is a convex combination of noise and data), for which ut(xtx1)u_t(x_t\mid x_1) is constant in tt.

The network outputs vθ(xt,t)v_\theta(x_t,t) and we minimize a weighted MSE (i.e., 𝔼t,εvθu2\mathbb{E}_{t,\varepsilon}\,\lVert v_\theta - u \rVert^2) expressed in target-space components: x̂1=𝛆+vθ(xt,t),=λposx̂1posx1pos2+λvelx̂1velx1vel2+λψx̂1ψx1ψ2,\begin{aligned} \hat{x}_1 &= \boldsymbol{\varepsilon} + v_\theta(x_t,t), \\ \mathcal{L} \; &= \; \lambda_{\text{pos}} \,\lVert \hat{x}_1^{\text{pos}} - x_1^{\text{pos}} \rVert^2 + \lambda_{\text{vel}} \,\lVert \hat{x}_1^{\text{vel}} - x_1^{\text{vel}} \rVert^2 + \lambda_{\psi} \,\lVert \hat{x}_1^{\psi} - x_1^{\psi} \rVert^2, \end{aligned} with λpos=1.0\lambda_{\text{pos}}{=}1.0, λvel=0.5\lambda_{\text{vel}}{=}0.5, and λψ=0.05\lambda_{\psi}{=}0.05. These weights were chosen empirically to balance the different scales and importance of position, velocity, and turn-rate errors during training. Here 2\lVert\cdot\rVert^2 denotes the sum of squared errors over tokens and channels.

We train with AdamW, a warmup+cosine learning-rate schedule, dropout 0.10.1, and maintain an exponential moving average (EMA) of parameters for evaluation and checkpointing.

At test time, we sample from the learned flow by initializing xt=0𝒩(0,I)x_{t=0}\sim\mathcal{N}(0,I) and integrating the ODE dxtdt=vθ(xt,t|history,context),\frac{d x_t}{d t} \;=\; v_\theta\!\left(x_t, t \,\middle|\, \text{history}, \text{context}\right), \label{eq:cfm_ode}

from t=0t{=}0 to t=1t{=}1 using a predictor–corrector scheme (Heun’s method; an explicit trapezoidal / second-order Runge–Kutta integrator) with a fixed number of steps. This yields a 12-token aircraft-centric future. Samples are then denormalized and mapped back to the global frame by (i) inverse-rotating (x̃,ỹ)(\tilde{x},\tilde{y}) and (ṽx,ṽy)(\tilde{v}_x,\tilde{v}_y) using (cosθ,sinθ)(\cos\theta,\sin\theta) from the context (fixed frame), and (ii) translating by the absolute reference (xabs,yabs,zabs)(x_{\text{abs}},y_{\text{abs}},z_{\text{abs}}). Repeating the sampling procedure produces ensembles of plausible 60 s futures conditioned on the observed 60 s history.

Accuracy and Probabilistic Calibration

We evaluate both point accuracy and probabilistic calibration of the proposed CFM forecaster on held-out test windows. Unless noted otherwise, we report results over NN windows (default N=512N{=}512) with KK forecast steps (default K=12K{=}12, i.e. 55 s stride over a 6060 s horizon).

Given a history HH, a context cc, and the learned vector field vθ(,tH,c)v_\theta(\cdot,t\mid H,c), we draw SS forecast samples by integrating the ODE with SS distinct initial noises in the aircraft-centric normalized frame. Each trajectory is then denormalized and mapped back to the global LV95 frame using the inverse of the per-window normalization and the fixed-frame transformation. We denote global positions by ŷτ(s)3\hat{y}^{(s)}_{\tau}\in\mathbb{R}^3 (Easting, Northing, altitude) and ground-truth by yτy^{\star}_{\tau} for τ=1,,K\tau{=}1,\dots,K.

Deterministic Accuracy vs. Horizon

We evaluate geometric prediction error at each forecast horizon τ\tau (in seconds) using two metrics: mean absolute error (MAE) and root mean square error (RMSE). We compare three predictors:

  1. Model (mean): Ensemble mean yτ=1Ss=1Sŷτ(s)\bar{y}_\tau = \frac{1}{S}\sum_{s=1}^S \hat{y}_\tau^{(s)}.

  2. Model (best-of-SS): A diagnostic lower bound selecting the sample closest to ground truth: yτbest=argminsSŷτ(s)yτ2,y^{\text{best}}_\tau = \arg\min_{s \le S} \bigl\lVert \hat{y}^{(s)}_\tau - y^\star_\tau \bigr\rVert_2, which probes ensemble coverage.

  3. Constant-velocity baseline: Linear extrapolation in global coordinates using the last observed ground and vertical speeds.

For any predictor 𝐲τ()\mathbf{y}^{(\cdot)}_\tau, errors over NN test windows are MAEτ=1Nn=1Nyτ(,n)yτ(n)2,RMSEτ=1Nn=1Nyτ(,n)yτ(n)22.\begin{aligned} \mathrm{MAE}_\tau &= \frac{1}{N}\sum_{n=1}^N \bigl\lVert y^{(\cdot,n)}_{\tau} - y^{\star(n)}_{\tau} \bigr\rVert_2, \label{eq:mae}\\[4pt] \mathrm{RMSE}_\tau &= \sqrt{\frac{1}{N}\sum_{n=1}^N \bigl\lVert y^{(\cdot,n)}_{\tau} - y^{\star(n)}_{\tau} \bigr\rVert_2^2 }. \label{eq:rmse} \end{aligned}

We report MAEτ\mathrm{MAE}_\tau and RMSEτ\mathrm{RMSE}_\tau for all three predictors as functions of the horizon τ={5,10,,60}\tau = \{5,10,\ldots,60\} s (default Δτ=5\Delta \tau{=}5 s).

Probabilistic Calibration Diagnostics

To assess whether the model’s forecast distribution matches empirical frequencies, we use a Probability Integral Transform (PIT) diagnostic. For each coordinate d{x̃,ỹ,z}d\in\{\tilde{x},\tilde{y},z\} and each (n,τ)(n,\tau) (forecast step), we compute a sample-based PIT value using the empirical rank: PITn,τ,d=1S+1(s=1S𝕀{ŷτ,d(s,n)yτ,d(n)}+Un,τ,d),Un,τ,dU[0,1],\mathrm{PIT}_{n,\tau,d} \;=\; \frac{1}{S+1}\left(\sum_{s=1}^S \mathbb{I}\!\left\{ \hat{y}^{(s,n)}_{\tau,d} \le y^{\star(n)}_{\tau,d} \right\} + U_{n,\tau,d}\right), \qquad U_{n,\tau,d}\sim U[0,1], which avoids degenerate values under ties in finite ensembles. For a calibrated univariate predictive distribution, PIT\mathrm{PIT} should be uniformly distributed on [0,1][0,1]. We therefore aggregate PITn,τ,d\mathrm{PIT}_{n,\tau,d} over nn and τ\tau and report axis-wise histograms; deviations from uniformity indicate under/over-dispersion or bias.

Application to a real-world conflict

To illustrate the operational use of the proposed approach, we analyze a real encounter extracted from ADS–B data (Figure 3). In this event, one aircraft was maintaining level flight at FL310 while the other was descending through FL310 on a converging path. The recorded data show that the minimum horizontal and vertical spacings fell below the prescribed separation minima (<5 nautical miles horizontal and <1,000 feet vertical), resulting in an actual LOS.

A pair of aircraft trajectories ending in a loss of separation.

Based on the observed histories of both aircraft, we generated SS stochastic future trajectories for each one using the CFM model. Every possible combination of one sampled future from each aircraft was resampled from 0.2 Hz (5 s resolution) to 1 Hz by linear interpolation and then examined to determine whether standard separation minima were breached or a MAC observed at any point within the prediction horizon.

Throughout the forecast, we monitored the horizontal and vertical spacing between the two aircraft. A situation was classified as a LOS when, at any moment, the horizontal distance between the aircraft fell below the LOS limits. To capture more critical encounters, we also defined a mid-air collision (MAC) proxy, corresponding to predicted cases where the aircraft approached closer than 0.03 nautical miles horizontally and 55 feet vertically.

By counting the proportion of trajectory pairs that met either of these criteria, we obtained straightforward Monte Carlo estimates of the probabilities of a future LOS or MAC within the prediction window.

Results

Ensemble Forecasts on Representative Cases

Figure 4 shows ensemble forecasts for three representative flights. Each panel displays the 60 s observed history (black), the 60 s ground-truth future (red), 128 sampled futures from the CFM model (blue), and the ensemble mean trajectory (yellow).

In the left panel, most samples follow a curved path while some continue straight. In the middle panel, all samples form a narrow bundle along the observed flight direction. In the right panel, the true continuation is straight, and several samples deviate slightly toward a right-hand branch. Across the three examples, the ensemble spread increases with prediction horizon, and the ensemble mean remains near the center of the sampled futures.

Ensemble (“spaghetti”) forecasts for three representative test flights. Black: observed history; red: ground-truth future; blue: 128 sampled futures; yellow: ensemble mean.

Flow Evolution and Vector Field Visualization

Figure 5 illustrates the temporal evolution of the learned conditional flow for one example case. The three panels correspond to integration times t=0t{=}0 (noise), t=0.5t{=}0.5 (intermediate state), and t=1t{=}1 (final prediction). Each map shows predicted positions (blue), sample vectors (orange), and the grid vector field (purple), together with the observed history (black) and ground-truth future (red).

The orange sample vectors visualize the model’s denoising dynamics in flow time tt: they are finite-difference displacements of the predicted future tokens between two consecutive ODE integration steps (i.e., Δŷ/Δt\Delta \hat{y}/\Delta t in the (x,y)(x,y) plane), and should not be interpreted as physical aircraft velocities in trajectory time τ\tau. The purple grid vector field is obtained by evaluating the learned vector field vθ(,tH,c)v_\theta(\cdot,t\mid H,c) on a spatial grid for a single synthetic future token (position channels set from the grid point, remaining channels set to zero), yielding a qualitative 2D slice of the full 12×712\times 7 vector field.

At t=0t{=}0, sample vectors are randomly oriented. At t=0.5t{=}0.5, the flow begins to align spatially along the future path. At t=1t{=}1, the trajectories form a coherent pattern that overlaps with the true continuation. The grid vector field exhibits smooth directional changes between neighboring locations.

Evolution of the learned conditional flow for a single test case. Each map shows predicted positions (blue), sample vectors (orange), and the grid vector field (purple) at three integration times (t=0t{=}0, t=0.5t{=}0.5, t=1t{=}1). Black: observed history; red: ground-truth future.

Forecast Error vs. Horizon

Figure 6 presents mean absolute error (MAE) and root-mean-square error (RMSE) as a function of prediction horizon. Metrics are computed over 512 test windows (prediction problems) for 3D Euclidean (x,y,z)(x,y,z), horizontal (x,y)(x,y), and vertical (z)(z) components. Each plot compares three estimators: ensemble mean, best-of-SS sample, and constant-velocity extrapolation (CV).

For all spatial components, errors increase monotonically with horizon. 3D and horizontal errors show similar growth patterns, while vertical errors remain smaller in magnitude.

The CV baseline yields consistently larger errors for both MAE and RMSE. Typically, over a 60 s horizon, the CFM model achieves a 3D MAE of 220 m, compared with about 320 m for the CV baseline. For RMSE, the CFM model reaches approximately 500 m, whereas the CV baseline remains just below 800 m.

The best-of-SS curve remains consistently below the other curves across all horizons, with both MAE and RMSE staying below 50 m throughout the prediction window.

Mean absolute error (left) and root-mean-square error (right) versus prediction horizon, computed over 512 test samples. Rows correspond to 3D Euclidean, horizontal, and vertical components. Blue: model mean; green: best-of-SS; red: constant-velocity baseline.

Probabilistic Calibration

Figure 7 shows the PIT histograms for the aircraft-centric longitudinal and lateral coordinates (x̃,ỹ)(\tilde{x},\tilde{y}) and for altitude zz. Both x̃\tilde{x} and ỹ\tilde{y} exhibit a central peak around 0.5 with lighter tails, indicating over-dispersion in the horizontal plane; the effect is more pronounced for the lateral component ỹ\tilde{y}. This behavior is expected at short horizons because lateral motion is driven more strongly by turning intent than longitudinal motion, making it harder to infer from recent history alone. The zz component is closer to uniform, suggesting better calibration in the vertical dimension.

Probability Integral Transform (PIT) histograms for aircraft-centric x̃\tilde{x} (longitudinal), ỹ\tilde{y} (lateral), and zz components, aggregated across 512 test windows with 256 samples each.

Results on the real-world conflict

Using the ADS–B histories of the two aircraft, we generated S=100S=100 stochastic futures for each trajectory and evaluated all 100×100100\times100 combinations of predicted paths. Among all paired samples, nLOS=8,345n_{\mathrm{LOS}} = 8,345 combinations (83%) resulted in a predicted LOS, while nCOL=36n_{\mathrm{COL}} = 36 (0.36%) met the stricter MAC threshold. These results indicate that the model assigns a realistic, non-negligible probability to a future LOS, consistent with the outcome observed in the actual flight data. The corresponding 95% Clopper–Pearson confidence interval for the collision probability is approximately [0.25%,0.50%][0.25\%,\,0.50\%].

Ensemble forecasts for the conflicting aircraft pair. Blue and red lines correspond to sampled futures for each trajectory, while the black lines correspond to the last minute of observation for each flight.

Discussion

Given one minute of observed flight history, the proposed CFM model can generate multiple plausible trajectories for the following minute. Each sample represents a distinct but realistic continuation of the aircraft’s motion, allowing the forecast to capture both the expected evolution and the uncertainty surrounding it. This ensemble property distinguishes the approach from deterministic predictors: instead of committing to a single extrapolated path, it provides a distribution of possible futures consistent with recent behavior.

The ensemble samples reflect context-dependent variability in aircraft motion: tight, low-spread ensembles emerge during stable, steady flight, while wider and occasionally multi-modal spreads appear in maneuvering phases such as turns or climbs. This adaptive spread indicates that the model has learned to represent uncertainty in a meaningful way. When motion is predictable, the ensemble converges; when intent is ambiguous, the model expresses multiple likely continuations.

The flow visualization indicates that the model learns a consistent vector field that continuously transforms random noise into structured trajectories.

Quantitatively, the CFM predictor consistently achieves lower MAE and RMSE than constant-velocity extrapolation, reducing 3D RMSE by roughly 40% at 60 s horizons. Vertical predictions are particularly accurate, reflecting the slower dynamics of en-route flight. The best-of-SS results confirm that the true trajectory is typically contained within the ensemble, suggesting that the generated variability captures the range of realistic futures.

We compare against constant-velocity extrapolation because it matches the linear-motion assumptions commonly used in short-term safety nets and in practical risk modeling. More advanced physics-based or learning-based baselines would be valuable, but are left for future work.

The PIT analysis shows that the CFM forecasts are over-dispersed, especially in the horizontal (xx, yy) components. The central peak and light tails in their PIT histograms indicate that ensemble spreads are wider than the true variability in the test data. In contrast, the zz component is better calibrated, showing a more uniform distribution. Incorporating additional contextual features, such as flight plans or intent information, could help reduce over-dispersion and improve calibration. In operational settings, horizontal over-dispersion may be conservative for safety, but may also inflate uncertainty volumes and increase nuisance alerts; this motivates further work on calibration. In the aircraft-centric PIT analysis, the over-dispersion is more pronounced laterally (ỹ\tilde{y}) than longitudinally (x̃\tilde{x}), consistent with the stronger influence of turning intent on short-horizon lateral motion.

The real-world encounter case study further highlights the operational relevance of the approach. When applied to a pair of aircraft that actually experienced a LOS, the model predicted an LOS in approximately 83% of all sampled trajectory pairs and a MAC in 0.36%. By representing uncertainty through ensembles rather than deterministic paths, the model enables direct estimation of the likelihood and severity of potential conflicts—offering a data-driven complement to existing risk-assessment methods. Nevertheless, accurate quantitative risk evaluation still depends on good probabilistic calibration.

Despite these promising results, several limitations remain. A small fraction of generated samples show unrealistic oscillations or curvature, indicating that the learned flow occasionally violates physical motion constraints. Incorporating kinematic regularization or lightweight flight-dynamics priors could mitigate this issue. The fixed 60-second prediction horizon, although operationally meaningful for safety-net applications, also constrains performance and should be adapted to the intended use case. Finally, the model currently relies solely on motion-derived ADS-B features; integrating contextual data such as flight plans, weather fields, or nearby traffic would likely improve both accuracy and calibration.

Conclusion and Outlook

This study demonstrates that Conditional Flow Matching (CFM) provides an effective generative formulation for short-term, uncertainty-aware aircraft trajectory prediction. Given one minute of observed motion, the model learns a continuous vector field that transforms stochastic perturbations into future trajectories over the following minute. The resulting ensembles capture both the expected continuation of flight and the uncertainty associated with short-term intent, producing forecasts that are more accurate and informative than conventional constant-velocity extrapolation.

Beyond predictive accuracy, the ensemble formulation of the CFM model offers a direct route from probabilistic forecasting to operational decision support. By representing future motion as a distribution rather than a single trajectory, it becomes possible to compute interpretable risk measures such as the probability of loss of separation or mid-air collision. In the presented case study, these probabilities aligned closely with the actual outcome, demonstrating that the model can provide early and quantitative evidence of potential conflicts. Such probabilistic indicators could complement existing safety nets by replacing binary thresholding with continuous risk levels, supporting more nuanced prioritisation and review of air traffic situations. To ensure operational reliability, however, ensemble calibration must be demonstrated.

In the longer term, models of this kind could be used to complement existing safety nets such as the STCA or the Airborne Collision Avoidance System (ACAS). By producing probabilistic forecasts that explicitly quantify future risk, they could provide an additional layer of context to existing deterministic alerting systems, helping distinguish genuine conflicts from expected manoeuvres. Equally, the same generative framework can be applied retrospectively for the forensic analysis of historical encounters, allowing quantitative reconstruction of uncertainty and intent in recorded loss-of-separation or near-miss events.

From a modeling perspective, several avenues of research emerge. First, the physical fidelity of generated trajectories can be improved by incorporating lightweight kinematic regularization terms or flight-dynamics priors to suppress oscillatory samples while preserving diversity. Second, extending the conditioning inputs beyond motion-derived ADS–B features to include flight plans, meteorological conditions, or surrounding traffic is likely to enhance intent inference and reduce lateral over-dispersion. Third, the expansion of the prediction to longer horizons to increase the usability of the model.

In summary, CFM provides a principled foundation for probabilistic trajectory forecasting in Air Traffic Management. It unifies deterministic accuracy, calibrated uncertainty, and interpretability within a single generative framework, offering tangible benefits for both safety analysis and decision support. While further work on calibration, dynamics regularization, and contextual conditioning remains, the presented results suggest that flow-based generative modeling represents a promising and operationally relevant step toward uncertainty-aware prediction and risk estimation in next-generation air traffic management systems.

Author contributions

  • First Author: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing (Original Draft), Writing (Review and Editing)

  • Second Author: Writing (Review and Editing)

  • Third Author: Visualization, Writing (Original Draft), Writing (Review and Editing)

Funding statement

This research was funded by the Swiss Federal Office of Civil Aviation, grant number 2022-046.

Open data statement

All the data used in this study can be downloaded from the OpenSky Network.

Reproducibility statement

The source code used for model training, evaluation, and figure generation is publicly available at github.com/figuetbe/generative-flight-predictions.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), 4171–4186.
EUROCONTROL guidelines for short term conflict alert part i: Concept and requirements. 2017. European Organisation for the Safety of Air Navigation (EUROCONTROL), Brussels, Belgium.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. 2020. Generative adversarial networks. Communications of the ACM 63, 11, 139–144.
Ho, J., Jain, A., and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851.
Holderrieth, P. and Erives, E. 2025. Introduction to flow matching and diffusion models. https://diffusion.csail.mit.edu/.
Jarry, G., Couellan, N., and Delahaye, D. 2019. On the use of generative adversarial networks for aircraft trajectory generation and atypical approach detection. ENRI international workshop on ATM/CNS, Springer, 227–243.
Kingma, D.P. and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Krauth, T., Krummen, J., and Figuet, B. 2025. Multi-objective CNN-LSTM for aircraft trajectory prediction with spatio-temporal confidence areas.
Krauth, T., Morio, J., Olive, X., Figuet, B., and Monstein, R. 2021. Synthetic aircraft trajectories generated with multivariate density models. Eng. proc., 7.
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., and Le, M. 2023. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.
Liu, X., Gong, C., and Liu, Q. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.
Liu, Y. and Hansen, M. 2018. Predicting aircraft trajectories: A deep generative convolutional recurrent neural networks approach. arXiv preprint arXiv:1812.11670.
Munoz, C., Narkawicz, A., and Chamberlain, J. 2013. A TCAS-II resolution advisory detection algorithm. AIAA guidance, navigation, and control (GNC) conference, 4622.
Olive, X. 2019. Traffic, a toolbox for processing and analysing air traffic data. Journal of Open Source Software 4, 1518.
Schäfer, M., Strohmeier, M., Lenders, V., Martinovic, I., and Wilhelm, M. 2014. Bringing up OpenSky: A large-scale ADS-B sensor network for research. IPSN-14 proceedings of the 13th international symposium on information processing in sensor networks, IEEE, 83–94.
Tong, A., Fatras, K., Malkin, N., et al. 2023. Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint arXiv:2302.00482.
Zeng, W., Chu, X., Xu, Z., Liu, Y., and Quan, Z. 2022. Aircraft 4D trajectory prediction in civil aviation: A review. Aerospace 9, 2, 91.