This paper investigates the usage of reinforcement learning for a global path planning task in terminal airspace, specifically for training a policy that can generate paths from any given position in the airspace to the runway. To do this, the Soft Actor-Critic (SAC) algorithm is trained on a simplified version of the Dutch airspace and compared to the solutions generated by the Dijkstra algorithm for varying discretization resolutions. SAC, which uses a Gaussian distribution for the action policy, has previously been shown to be successful in other global planning tasks in continuous environments. However, evaluating the policy by following the mean of the learned distribution, which is the standard evaluation method, may yield suboptimal performance when dealing with complex cost functions that deviate from a normal distribution. To address this, the paper proposes and evaluates a sampling-based strategy, which generates an ensemble of paths by sampling from the learned policy distribution. These three methods: mean-based SAC, Dijkstra and sampling-based SAC, are then tested on a bi-criterion cost function which includes both fuel and noise emissions in varying ratios. It was found that Dijkstra outperforms mean-based SAC for all cost ratios at the best discretization resolution, regardless of the neural network architectures used. However, sampling-based SAC results in consistently lower costs than both Dijkstra and mean-based SAC, particularly for the more complex cost functions that have a higher focus on noise mitigation. These findings highlight some limitations in mean-based evaluation for distribution models and indicate potential performance benefits that can be obtained with better-tailored evaluation strategies.
For the future of air traffic management (ATM) and air traffic control, the European ATM Master Plan, developed by the SESAR Joint Undertaking, envisions a future characterized by advanced automation, with certain aspects of air traffic control potentially operating entirely autonomously without human intervention [Undertaking et al. 2016]. An advantage of these higher levels of automation is the potential broad implementation of free route airspace, where aircraft can fly individually optimized paths, tailored to their own operational requirements, such as origin and destination, or aircraft performance and airline preferences. Compared with the current airspace structure, which relies on predetermined routes such as the Standard Terminal Arrival Routes (STARs) and fixed waypoints, free routing can lead to a reduction in emissions and operational costs through utilization of the entire continuous airspace, rather than being constrained by discretized waypoints [Aneeka and Zhong 2016]. Achieving these higher levels of automation, as outlined in the European ATM Master Plan, could rely on the potential of machine learning (ML) and artificial intelligence (AI) for various ATM tasks. This in turn has led to an increase in studies that investigate applications of ML/AI in ATM [Zeng et al. 2022; Wang et al. 2022].
Reinforcement Learning (RL), a branch of ML that focuses on problems which require sequential decision making, has been shown to be successful in various ATM related challenges [Razzaghi et al. 2024]. For example, RL algorithms have been applied to the tasks of conflict detection and resolution, or maintaining safe separation between aircraft, demonstrating high levels of performance compared to analytical methods, especially at higher traffic densities [Groot et al. 2024]. For the conflict resolution task most RL applications investigated have focused solely on the safety aspect of the task, as the complexity of multi-criterion cost functions can lead to policies that inadvertently prioritize efficiency over safety, which is unacceptable in ATM [Wang et al. 2022]. Because of this, RL for path planning in aviation with a focus on other cost factors unrelated to safety has not been investigated extensively. However, to obtain the most benefits of RL, it is also necessary to study the application of RL to single aircraft path planning over longer path horizons, with more complex cost functions. The findings in both areas can then eventually be combined, for example, through the use of hierarchical frameworks such as hierarchical RL [Pateria et al. 2021], ensuring both safe and cost-effective paths.
Additionally, RL for single-agent path planning has demonstrated success in various domains [Zhang et al. 2024] including that of minimally invasive surgery, surpassing traditional path planning methods for continuous environments, such as Rapidly-exploring Random Trees * (RRT*) [Karaman and Frazzoli 2011]. Although these applications may appear unrelated, the task of path planning in a known terminal airspace shares various similarities with path planning on a brain map. In both scenarios, the starting point is variable, either the entry point in the airspace or the initial incision location in the skull, while the destination is fixed, such as an airport or a tumor. This requires the planning of multiple paths within a known environment, essentially memorizing the entire problem space. Furthermore, both problems involve regions with varying associated costs, leading to a trade-off between path efficiency and avoidance of sensitive areas, without direct obstacle avoidance. These similarities justify further investigation of RL as a method for path planning in the terminal airspace.
The set of RL algorithms that showed the best performance on the minimally invasive surgery tasks are distribution models, specifically Soft Actor-Critic (SAC) [Haarnoja et al. 2018] and Proximal Policy Optimization (PPO) [Schulman et al. 2017], which are RL algorithms that learn a stochastic policy by modeling actions as Gaussian distributions. During evaluation, the mean of the learned distribution is then used to determine the actions used for planning the path. Because the underlying cost distributions are complex and may not be accurately captured in a normal distribution, it is hypothesized this mean-based evaluation may not always yield the optimal path.
This paper therefore compares the use of the mean of the learned action distribution to using a sampling strategy from the learned distribution to generate an ensemble of paths. These paths are then compared with the paths generated by the mean of the learned distribution to identify if potential performance improvements can be obtained. To do this, first, the impact of the neural network architecture used for the SAC RL algorithm on a path-planning task within the Dutch National Airspace, is evaluated. This is done to ensure that the base model used for evaluation sufficiently captures the environment. These models are then compared with benchmark paths generated with Dijkstra for various discretization resolutions, in order to verify that the initial paths generated by mean-sampled SAC resemble the optimal paths in discrete space. Finally, the ensemble of the paths generated through sampling from the learned distribution will be evaluated and compared with the results from both mean-sampled SAC and Dijkstra.
The remainder of this paper is structured as follows. Section 2 will define the path planning problem used for this study, which is a multi-criterion path planning problem that focuses on a combination of noise and fuel mitigation. Section 3 will describe the methodology, which consists of the associated cost functions and simulator used to determine the fitness of the generated paths, as well as the specifics and implementations of the SAC and Dijkstra algorithms. Section 4 outlines the conducted experiments and their results. Section 5 will discuss the observed results and finally Section 6 will conclude the paper and highlight potential future works.
To compare the performance of the two path planning methods, a case study is conducted for the Dutch national airspace. This section will describe the environment and formalize it into a path planning problem by providing the problem input, possible actions, constraints and finally the cost function used for determining the fitness of the different paths.
The environment used for this study is a simplified representation of the Dutch airspace. To simplify the environment, a circular border, with a radius of 277 km (150 NM), centered at Schiphol ( lat, lon) is used instead of the actual border of the Dutch airspace. This airspace is also shown in Figure 3. In this figure the red lines in the center of the environment represent the approach constraints, further explained in Section 2.2.
Within this airspace, the goal of the path planning algorithm is to generate paths from the border of the environment to the beginning of runway 27, located at the end of the red approach lines in the center of Figure 3, while optimizing the bi-objective cost function of fuel and noise emissions. This requires a trade-off between path efficiency and avoiding highly populous areas. Figure 2 shows the population data used for calculating the noise emissions, which is obtained from Eurostat [European Commission 2018].
The beginning of runway 27 was chosen as the terminal state of the environment as it requires aircraft to approach over land, increasing the relevance of the noise emission cost.
The path planning problem is solved as a 2D problem, as the vertical component is isolated and follows the optimized vertical profile for continuous descent operations.
The constraints used in the environment are illustrated by the red lines shown in Figure 3. These constraints are the circular border with a radius of 277 km centered at lat, lon, to ensure that agents do not exit the airspace, and two lines of 37 km (20 NM) originating at the end of runway 27 at , which ensures that aircraft approach the runway from an appropriate angle. Any of these constraints are considered violated when the shortest path between 2 consecutive waypoints intersects these lines. In this study waypoints are defined by any (lat,lon) coordinate and serve as the next, intermediate, target location.
For the cost function, a bi-objective problem of minimizing fuel cost and noise emissions is formulated. The general function of the per-segment cost function is given in eq. [eq:costeq], here and are the normalized noise and fuel cost associated with a segment flown described by the waypoints ‘’ and ‘’ respectively and ‘’ is the weight attributed to the individual cost elements. The cost components are normalized to ensure that, for any random path, the contribution of the individual elements to the total cost is equal, given that . Note that this is not necessarily true for optimized paths, as it is easier to mitigate noise costs than fuel costs by avoiding highly populous areas.
The total cost of any path can then be calculated by summing all segments describing the path.
To determine the noise cost, , the noise emitted by the aircraft, , is calculated by combining the noise-power-distance database of Eurocontrol [Eurocontrol Experimental Center 2020] with the thrust of the aircraft as determined by the Open Aircraft Performance (and Emission) Model, OpenAP [Sun et al. 2020]. This emitted noise is then convoluted over the population density map from Eurostat (Figure 2) using eq. [eq:setnoise], which follows the inverse square law for noise dissipation, at a sampling rate of 1hz to obtain the noise pollution during 1 second, . and refer to the population in, and distance to, cell ‘’ in the population data respectively. This is also illustrated in Figure 4.
These noise costs, calculated at 1-second intervals, are then summed to obtain the total noise cost of a single segment, .
Finally, the noise cost for a single segment is normalized by dividing it by the expected noise cost for flying over a random segment of length , as is illustrated in eq. [eq:noisenorm], resulting in the normalized noise cost . This expected noise is calculated by assuming that the population is homogeneously divided over the environment, it therefore represents the average cost for any segment of length , independent of the values of and .
It is acknowledged that this method for calculating noise cost does not consider the hearing and annoyance thresholds of humans, or looks at more sophisticated noise models. However, for the purpose of this research, the implementation is considered sufficient and easier to reproduce. The used methods however, are independent from the used noise cost metric, and can also be used for different noise calculations.
For the fuel cost, , OpenAP is used to obtain an estimate of the fuel flow, , based on the used aircraft model, engine type, altitude and airspeed. To get the fuel cost per flown segment of the aircraft, this fuel flow is summed over time, as is shown in eq. [eq:fuelcons], where is the simulation time-step. Starting at and ending at , which are the times at which the aircraft starts from the current and arrives at the next waypoint respectively.
For conducting the experiments and evaluating the paths, the BlueSky open-source Air Traffic Simulator is used [Hoekstra and Ellerbroek 2016]. This simulator allows the usage of OpenAP as the performance model of the aircraft specified, making it easy to evaluate the cost of the flown paths and ensuring that the dynamics of the aircraft are not ignored in the final paths. The code and plugins used for this research can be found in the 4TU repository [Groot 2023]. The most recent version of BlueSky can be found on GitHub [TUDelft-CNS-ATM 2023].
SAC is an RL algorithm developed by Haarnoja et al. [Haarnoja et al. 2018]. The main premise of the method is to reduce the brittleness of performance with respect to hyperparameters and explicitly incorporate exploration by including a maximum entropy term into the reward function. The SAC algorithm uses three neural networks, two of which are only used during the training phase of the method. These networks are called the critic, value and policy networks.
To incorporate the entropy component of the method, the policy network learns an action distribution, outputting both a mean and a standard deviation for a given input state. During the training phase, actions are sampled from this distribution, which aids in the exploration of unseen states when compared with deterministic RL methods. During the evaluation phase, normally the means are used directly instead, as it is assumed that for a converged model, these are the optimal actions. However, sampling can still be used to generate an ensemble of solutions, which is useful when new paths should be created due to detected conflicts in a multi-agent path planning setting. The impact of sampling is evaluated by analysing the performance differences over ensembles of paths in Section 4.7.
For the input representation, the coordinates are translated to the Cartesian coordinate system with corresponding to the end of runway 27 at lat, lon, such that the terminal state of the agent approaches 0,0 in Cartesian. This is similar to the state representation often used in grid-world examples [Barto 2021]. To ensure that the input values are between -1 and +1, which aids in stability during the initial phase of the training, the input states are normalized by dividing them by the maximum distance from the environment centre. This state representation is shown graphically in Fig. 5, as defined by the axes values. The state representation used in this study is deliberately kept simple because this design choice allows the next state to be determined simply from the current state and the selected action, enabling recursive path generation without requiring intermediate simulation steps. While more complex representations may potentially offer improved performance, this simple state representation is considered sufficient for the purpose of this study.
The action space of the model is defined as , as is shown in Fig. 5. An upper limit is given to this action space to limit the size of the action space, as it was found that a larger action space results in slower convergence of the model during training. Additionally, because any path can be defined by an infinite set of waypoints, optimizing this value was not considered part of the scope of this research, as it is hypothesized that it will not influence the quality of the paths. For future research, it can however still be worthwhile to investigate the validity of this hypothesis and the influence on convergence rate.
For the cost representation, or reward function, the same costs as described in section 2.3 are used, they are however, multiplied with -1 to change the problem from a minimization to a maximization problem. Additionally, the constraints mentioned in section 2.2 are modelled as a cost of ‘-1’, to ensure that any optimal solution, if it exists, does not violate the constraints, similar to the big M method used in operations research. Finally, a successful approach is rewarded ‘+5’ to stimulate early termination of the episodes during initial training. This alteration does not mathematically alter the optimal paths as this reward is given to every valid path generated by the model. Both a successful approach and constraint violations result in terminal conditions for the path planning method. The entire reward function is given in eq. [eq:costfadapt].
To train the models, randomly generated scenarios are used, and simulated in the BlueSky simulator. In these scenarios, the initial states are randomly sampled uniformly from the entire state space. This is in contrast with the evaluation scenarios, where the initial states are located on the border of the environment. This strategy is employed to improve the exploration of the method over the entire state space. The model then interacts with the environment for a total of 15 actions or until a terminal condition is obtained, after which a new initial state is generated. This process is repeated for a total of 200.000 actions, after which the current model is saved and used for evaluation. The algorithm used for training is shown in algorithm [alg:training].
An additional advantage of training in the simulator is that it ensures that the paths generated by the method can be flown by the dynamics of the aircraft model used. This means that for higher-fidelity simulators no performance constraints have to be specified for the problem, as is commonly done in aircraft related path planning problems [Granberg et al. 2016; Chevalier et al. 2019].
new_episode False obtain obtain and terminal condition from add to buffer update model steps += 1 delete_aircraft() new_episode True
The hyperparameters used for the SAC algorithm are given in Table 1. These hyperparameters are the same as those used in the original paper by Haarnoja et al. with four exceptions; It was found that a higher learning rate (3e-3 instead of 3e-4) and higher discount factor (0.995 instead of 0.99) leads to faster convergence of the training. Additionally, four different network architectures are tested, to analyse the impact of the number of neurons (256 & 1024) and layers (2 & 3) on the performance of the method (Section 4.4). This is done in an analogy to the discretization resolution used for discrete methods such as Dijkstra.
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Learning rate | 3e-3 |
| Discount factor () | 0.995 |
| Memory buffer size | 10e6 |
| Sample size | 256 |
| Smoothing coefficient () | 5e-3 |
| Network update frequency | 1 |
| Number of layers | [2,3] |
| Neurons per layer | [256, 1024] |
| Nonlinearity | ReLU |
In this research, the Dijkstra algorithm will be used for generating optimal single-agent paths in discrete space. These paths function as a comparison for the paths generated by the SAC algorithm. For the algorithm, the Dijkstra implementation of the open-source Python package NetworkX is used. As Dijkstra requires the environment to be discrete, the following sections will first describe the discretization method applied to the problem. This is followed by an explanation of the post-processing used for enhancing the performance of the solutions in continuous space.
To use the Dijkstra algorithm for path planning, the environment is required to be discretized. To discretize the environment, a square grid is used, where for each cell the total population is summed and assumed to be distributed homogeneously throughout the cell. From any cell in this square grid (excluding border cells) the Dijkstra algorithm can select all neighbouring cells (including diagonally connected cells) as well as cells reachable via an "L"-shaped move analogous to a knight’s move in chess. These transitions are then used to define the edges of the graph to which the Dijkstra algorithm is applied.
The resolution used for the discretization of the problem is nontrivial. Lower resolutions might not sufficiently capture the entire solution space, leading to sub-optimal solutions, whereas higher resolutions might not capture the noise dissipation in continuous space and have more frequent heading changes, leading to deficiencies when evaluated in the continuous environment simulated in BlueSky. Therefore all discretization resolutions shown in Table 2 are used for initial evaluation. Section 4.5 will go over the experimental setup and results for these different resolutions.
| cell size (km) | grid-world size |
|---|---|
| 13 | 40 x 40 |
| 10 | 52 x 52 |
| 8 | 65 x 65 |
| 5 | 104 x 104 |
| 4 | 130 x 130 |
| 2 | 260 x 260 |
| 1 | 520 x 520 |
As a result of the discretization of the problem, the shortest path above zero population areas (such as large bodies of water) is limited by the action resolution available to the Dijkstra method. Therefore a check is made to see which nodes cross zero-population areas, and if that is the case, these nodes are removed from the final output. This results in direct routing above the sea and other zero-population areas. The difference between the solutions with and without this post-processing is shown in Figure 8. A similar method for smoothing the paths can not be used for areas with a non-zero population count, as it is impossible to guarantee that this will improve the solution due to the bi-criterion nature of the cost function also relying on noise abatement.
To determine the efficacy of the different methods, four different experiments are conducted within the BlueSky Air Traffic Simulator. All experiments are evaluated using the same experimental scenario and control variables, described in Section 4.1 and Section 4.2 respectively. The dependent variables used to determine the quality of the paths are explained in section 4.3.
The experiments conducted are:
The impact of neural network architecture for SAC. (Section 4.4)
The impact of discretization resolution for Dijkstra. (Section 4.5
Performance differences between SAC and Dijkstra for various cost weight ratios. (Section 4.6)
Sampling-based path planning with the learned action distribution. (Section 4.7)
For the experimental scenario used for evaluating the paths, 36 initial states are generated on the border of the circular environment described in Section 2.1, with an equal spacing of 10 degrees to ensure complete coverage of the entire heading range. The paths generated from these initial states to the terminal state (beginning of runway 27) are then simulated in BlueSky and evaluated against the total noise and fuel cost specified in Section 2.3.
In all experiments, the dynamics, noise and fuel calculations are based on an A320 with an assumed constant mass of 66,000 kg. The groundspeed of the aircraft is held constant at 463 km/h (250 kts). Finally, for the noise and fuel calculation, a fixed altitude of 4.57 km (15,000 ft) is used, as for the vertical component it is assumed that aircraft follow their optimal vertical profile for continuous descent operations after the horizontal path is planned. This does influence the optimality of the solutions closer to the airport where the altitude is lower, as in practice the noise cost there should be higher than close to the border of the environment. However, closer to the airport the freedom for path changes is lower due to the approach angle of the runway, it is therefore assumed that this influence is not big enough for the purpose of this study, as all compared methods are subjected to these same conditions. Future studies should however investigate and evaluate its impact.
The primary performance indicators used for evaluating the quality of the paths are the different cost components. These values will be presented as normalized, unit-less values, which allows comparing the values on the same scale. The total cost is the main performance indicator, as it is what is used to calculate the cost of the paths, however, the isolated noise and fuel costs can still provide additional insights into the generated paths, as it is possible to compensate for noise emissions with a shorter path, leading to different potential policies for the SAC based method.
Although not the primary focus of this study, the computation time required for generating a single path is also considered an indicator of the feasibility of the methods. Especially for the sampling-based method, or multi-agent scenarios with re-planning, which both require multiple paths to be generated, computational requirements are an important factor. For SAC the computation time is based on a fully trained model, as the environment is considered static, and once learned can be used for any initial state in the environment. Hence training time is not included in this metric.
The last indicator used for evaluating the methods is the number of turns. The number of turns is an indicator of path complexity, and is observed as an additional objective measure for comparing the different paths. The main purpose of this indicator is to compare the different paths generated by the different network architectures for SAC in Section 3.2.5.
The number of turns is defined as the number of heading changes required to fly the path, regardless of the size of the heading change.
For the SAC algorithm, it is hypothesized that the path quality is related to the network architecture used for the model. This is because more complex network architectures allow for more fine-tuned decisions, similar to a higher resolution for the Dijkstra method. To test this hypothesis, the SAC algorithm is trained with four different network architectures underlying the model structure. For this experiment is set at 0.5, ensuring equal weight is given to both the fuel and noise cost. The network architectures evaluated in this experiment are given in Table 3 and are all trained according to the training procedure outlined in Section 3.2.4.
| number of layers | neurons per layer |
|---|---|
| 2 | 256 |
| 2 | 1024 |
| 3 | 256 |
| 3 | 1024 |
The path cost during training is given in Figure 9 and shows the evolution of performance for the four different models for the first 5000 episodes. It can be observed that all network architectures have a steep increase in performance between episodes 1000 and 2000, however, no clear distinction can be observed for the different network architectures during the early stages of training.
To better distinguish between the performance of the models after training, Table 4 shows the performance of the different network architectures after 200,000 actions. From this table, it can be observed that the network architecture of 3 layers with 256 neurons yields the highest performance. Interestingly, the paths generated by the 3 layers 1024 neurons network on average have a higher noise cost, which contradicts the hypothesis that more complex network architectures can generate more complex paths, which are required for higher noise abatement. The results in the table also show an increase in computational time for network architectures of higher complexity, which is expected based on the increase in the number of required computations for more complex networks.
| Architecture | Total Cost | Noise Cost | Fuel Cost | Computation time (s) | # Turns |
|---|---|---|---|---|---|
| 2 layers, 256 neurons | 42.1 | 32.0 | 52.1 | 11.0e-4 | 5.4 |
| 2 layers, 1024 neurons | 40.8 | 29.4 | 52.2 | 12.2e-4 | 5.6 |
| 3 layers, 256 neurons | 40.3 | 28.3 | 52.1 | 11.5e-4 | 5.3 |
| 3 layers, 1024 neurons | 40.6 | 29.7 | 51.5 | 14.5e-4 | 5.4 |
Finally, Figure 14 shows the paths generated by the different networks after training. In this figure the main areas of difference are encircled in red, highlighting slight variations in the policies of the different methods. Figure 10 explains the higher noise cost associated with the 2 layers, 256 neurons model, as the paths are relatively direct and straight. This shows that for paths of higher quality, a slightly more complex network architecture such as the 3 layers, 256 neurons model is the preferred option, although the exact configuration likely depends on the underlying problem/environment.