AutoBots: Using Reinforcement Learning with Carla & DonkeySim

COGS 188 [Artificial Intelligence Algorithms] - Final Project

Group Members

Abhay Anand
Ashesh Kaji
Ansh Bhatnagar
Varun Pillai

Abstract

The goal of this project is to train a Reinforcement Learning (RL) Classifier on autonomous vehicles. We plan to use both CARLA and DonkeyCar Simulator to navigate our vehicle. CARLA will provide a complex urban driving environment, while the DonkeyCar simulator will be used for a simpler track-based navigation on it's “Warren Field” circuit. We plan to solely rely on a Lidar sensor for data collection because of its robustness in capturing depth information and obstacle detection regardless of lighting conditions, an advantage over typical computer vision based data collection. We implement two deep RL algorithms: Actor-Critic and Proximal Policy Optimization (PPO), both designed for continuous action spaces since algorithms like simple a simple Q learning to are ineffective for problems in continious action spaces. We will use the gathered data to train agents to take optimal actions such as steering, acceleration, and braking based on the cars current position relative to the world. Performance will be evaluated using key metrics such as cumulative reward, lap completion time, and distance travelled. By comparing these metrics across different models and training scenarios, we aim to determine which RL method provides the most robust and efficient control for autonomous driving in simulated environments.

Background

Autonomous driving has rapidly advanced due to improvements in computing power and machine learning, particularly reinforcement learning (RL), which enables autonomous agents to learn optimal control strategies through trial and error. Unlike traditional rule-based or supervised learning methods, RL-based approaches can dynamically adapt to new environments and uncertain conditions, making them well-suited for self-driving applications [1]. However, a significant challenge in RL-driven autonomous navigation is the simulation-to-reality gap, where models trained in virtual environments struggle to perform reliably in real-world settings due to differences in sensor noise, road textures, and unexpected obstacles [2]. Research on domain adaptation and transfer learning continues to address this issue by fine-tuning RL policies with real-world data [3].

Simulation environments play a crucial role in training RL-based autonomous driving models. DonkeyCar Simulator (DonkeySim) provides a lightweight platform for testing self-driving models on controlled tracks and is widely used due to its accessibility and ease of experimentation [4]. The simulator provides key sensory inputs such as camera feeds, speed readings, and steering angles, enabling the development of reinforcement learning pipelines without requiring physical hardware. For more complex urban driving scenarios, CARLA offers a high-fidelity simulation environment with dynamic traffic, weather conditions, and diverse road layouts [5]. CARLA allows for more extensive testing of RL models in realistic settings, making it a crucial tool for benchmarking autonomous navigation systems.

Recent advancements in policy optimization techniques have improved training stability and efficiency in reinforcement learning for autonomous driving. Proximal Policy Optimization (PPO) has emerged as a preferred method for continuous control tasks due to its ability to balance exploration and exploitation while preventing overly large policy updates [6]. Additionally, Actor-Critic methods provide an effective framework for reinforcement learning by combining value-based and policy-based learning, resulting in more stable and informed decision-making in autonomous navigation tasks [7].

Self-driving technology has the potential to improve road safety, efficiency, and accessibility, but achieving reliable autonomy remains a challenge. Reinforcement learning-based methods, combined with improvements in simulation environments, optimized reward functions, and safer real-world deployment strategies, can contribute to the refinement of autonomous navigation systems. Continued research on policy optimization, explainable AI, and real-world generalization will be essential for making self-driving technology both practical and ethically responsible.

Why is this Important?

Autonomous driving stands to improve road safety, increase mobility, and reduce congestion. However, it also introduces unique challenges in perception, planning, and control. Studying reinforcement learning in this domain is crucial for advancing algorithms that can handle high-dimensional state spaces and continuous action controls, ultimately bringing us closer to reliable self-driving cars.

Problem Statement

Autonomous navigation for industrial and factory environments requires precise and efficient vehicle control to ensure safe and timely transportation of goods. Traditional rule-based and vision-based approaches struggle with real-time adaptability and robustness in dynamic settings where numerous unexpected obstacles may arrise due to minor mishaps. Through our project, we aim to develop a deep reinforcement learning (RL) model that enables autonomous vehicles to navigate factory environments using only LiDAR data as input. By leveraging reinforcement learning techniques, particularly Proximal Policy Optimization (PPO) and Actor-Critic methods, we aim to train a model capable of handling continuous action spaces while minimizing computational complexity. Our goal is to create an efficient, collision-free, and fast driving policy that enhances safety, accuracy, and cost-effectiveness in automated logistics and manufacturing operations.

Data

For this reinforcement learning project, we generated training data through interactions with the CARLA simulator, a high-fidelity environment designed for autonomous vehicle research. Unlike traditional datasets, our approach relies on real-time sensory inputs from the simulator, with LiDAR data serving as the cornerstone of our state representation.

LIDAR Data Collection for CARLA Implementation

LiDAR Sensor: We attached a LiDAR sensor to the vehicle in CARLA, configured with a 50-meter range to capture point cloud data representing distances to surrounding objects.
- By using LIDAR data instead of camera inputs, we avoid the computational overhead of convolutional neural networks, making our model efficient for real-time industrial applications while maintaining robust environmental perception.
Data Processing: The raw LiDAR point cloud is processed to create a compact yet informative state:
- The point cloud is divided into 8 sectors, each spanning 45 degrees around the vehicle (from -180° to 180°).
- For each sector, we compute the minimum and maximum distances to obstacles, yielding 16 features (8 minimums and 8 maximums). If no points are detected in a sector, a default distance of 50 meters is used.
- This approach reduces computational complexity while preserving spatial awareness, enabling the agent to detect obstacles in all directions.
State Augmentation: The LiDAR-derived features are augmented with two additional variables:
- Distance to the next waypoint, calculated as the Euclidean distance from the vehicle's current position to the target waypoint's location.
- Angular difference between the vehicle's current heading (yaw) and the direction to the next waypoint, normalized to [-180°, 180°].
Implementation Details: In our CarlaEnvWrapper class, the _get_state method processes LIDAR data by converting raw points into polar coordinates (angles and distances), binning them into sectors, and computing min/max distances. The state is updated synchronously with each simulation tick to ensure consistency.

DonkeySim Implementation

LiDAR-Based State Representation: Instead of relying on the default camera feed, we integrate a virtual LiDAR sensor for spatial awareness.
- The LiDAR readings are discretized into three key sectors: left, center, and right, providing depth perception.
Additional State Variables
- Speed of the vehicle
- Distance to track boundaries (to prevent off-track behavior)
- Angular deviation from the optimal racing line (helps maintain smooth turns)
Experience Collection & Training Data
- Each interaction produces an experience tuple: (state, action, reward, next_state).
- Actions taken: steering, acceleration, and braking.
Rewards are assigned based on:
- Staying on track
- Maintaining an optimal speed
- Avoiding excessive braking or sharp turns
Enhancing Generalization
- Experimenting with different reward functions to balance exploration and exploitation.
- Testing across multiple track variations for robustness.
- Adjusting penalties for off-track behavior and excessive steering to promote smooth driving.
Efficient RL Training Pipeline
- The collected data is used to iteratively train the PPO model.
- By refining reward structures and state representations, we optimize the agent's learning process for autonomous driving in DonkeySim.

Proposed Solution

State Representation
- The vehicle relies solely on LiDAR data, processed into an 18-dimensional state vector (16 LIDAR features + 2 waypoint features), as detailed in the Data section.
- This design ensures a low computational footprint and real-time decision-making capability.
Action Space
- The action space is continuous, with two dimensions: throttle/brake ([-1, 1], where positive values are throttle and negative are brake) and steering ([-1, 1]).
- This allows the agent to dynamically adjust speed and direction, learning the interplay between velocity and turning for smooth navigation.
Reinforcement Learning Approach
- We implemented Proximal Policy Optimization (PPO) using PyTorch Lightning, training the agent in the CARLA simulator.
- We used Proximal Policy Optimization (PPO) from stable-baselines3, training the agent in the DonkeySim simulator.
- PPO balances exploration and exploitation, making it ideal for our continuous control task.
Neural Network Architecture
The PPO agent relies on two distinct neural networks: the actor, which determines the policy (action selection), and the critic, which estimates the value function. These networks are designed as multi-layer perceptrons (MLPs) with the following structures:
- Actor Network:
  - Structure: A four-layer MLP:
    - Input Layer: 18 neurons, corresponding to the state dimension (e.g., sensor data, velocity, etc.).
    - Hidden Layers: Three layers with 256, 256, and 128 neurons, respectively, each followed by Tanh activation functions.
    - Output Layer: 4 neurons, representing the mean and log standard deviation (log_std) for two continuous actions: throttle and steering (2 neurons per action).
  - Output Processing:
    - Throttle Mean: Passed through a sigmoid function to produce values in the range [0,1], biasing the agent toward forward movement.
    - Steering Mean: Passed through a tanh function to produce values in the range [-1,1], enabling smooth left and right turns.
    - Standard Deviation: The log_std outputs are exponentiated, clamped, and constrained to a minimum value to ensure sufficient exploration during training.
  - Initialization: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
  - Purpose: The Tanh activations help stabilize policy updates, while the split output design accommodates the continuous action space required for driving control.
- Critic Network:
  - Structure: A four-layer MLP:
    - Input Layer: 18 neurons, matching the state dimension.
    - Hidden Layers: Three layers with 256, 256, and 128 neurons, respectively, each followed by ReLU activation functions.
    - Output Layer: 1 neuron, providing the value estimate for the given state.
  - Initialization: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
  - Purpose: The ReLU activations support effective value approximation, enabling the critic to provide stable and accurate estimates of the state's expected return.
Reward Function Design
The reward function evolved iteratively to guide the agent toward safe, efficient, and route- following behavior:
- Initial Reward Function:
  - Collision Avoidance: A penalty of -50 was applied for collisions to prioritize safety.
  - Speed Maintenance: Reward was proportional to distance traveled per step, encouraging forward movement.
  - This basic design promoted movement while avoiding obstacles but lacked route guidance.
- Intermediate Reward Function:
  - Lane Discipline: Added a -1 penalty for lane invasions to keep the vehicle within track boundaries.
  - Speed Regulation: Introduced a target speed of 30 km/h, with a penalty (-0.1 * speed - target ) for deviations, and an additional -1 penalty for speeds below 5 km/h to prevent stalling.
  - Steering Smoothness: Penalized large steering actions (-0.5 * |steering|) when speed was below 5 km/h to reduce erratic behavior at low speeds.
  - This improved track adherence and consistency but didn't ensure progress along a specific path.
- Final Reward Function:
  - Waypoint Following: Added a reward based on proximity to the next waypoint (max(0,5-distance/10)), encouraging route adherence.
  - Heading Alignment: Included a bonus (max(0,1-|angle_diff|/180)) for aligning the vehicle's heading with the waypoint direction, promoting smoother turns.
  - Progress Reward: Retained distance traveled as a base reward, augmented by waypoint incentives.
  - Safety Penalties: Kept collision (-50) and lane invasion (-1) penalties.
  - Stuck Detection: Penalized (-2) if the vehicle's position varied by less than 1 meter over 20 steps, preventing circular or stagnant behavior.
  - Implemented in CarlaEnvWrapper.step, this final version balances safety, efficiency, and navigation.
Deployment and Applications
Setup deployment details

A full list complete setup details can be found in the code repo's README. Essentially the following items must be setup:
1. Carla 0.9.15 must be setup on a gpu based machine, as well as the python api for the same version for full compatibility
2. Python must be setup with respective packages
3. The car is setup in the default map with the default settings. It is a real world simulation, but with no NPCs to reduce complexity given the projects scale.
Potential Future applications
- The trained PPO model can optimize logistics in factory settings, enabling autonomous vehicles to transport goods safely and efficiently along predefined routes.
- This approach reduces costs and enhances precision in industrial automation.

Evaluation Metrics

Cumulative Reward
- Definition: The total reward accumulated over an episode, calculated based on the reward function defined in your project.
- Significance: This metric reflects the overall performance of the agent. Higher cumulative rewards indicate better navigation, fewer collisions, and more effective adherence to the intended route. It serves as a primary indicator of policy improvement during training.
Collision Rate
- Definition: The frequency of collisions with obstacles or boundaries during an episode.
- Significance: A lower collision rate is desirable, as it demonstrates the agent's ability to navigate safely and avoid hazards. This metric is critical for evaluating the safety performance of the driving policy.
Distance Traveled
- Definition: The total distance covered by the vehicle over the course of an episode.
- Significance: When paired with a low collision rate, a higher distance traveled suggests efficient and effective navigation. This metric highlights the agent's progress and ability to follow the desired path.

Results

Reference the videos here: https://drive.google.com/drive/folders/1K7cDpq456Woh-fQq7A6EAID3Nz-YboTQ?usp=sharing

The primary objective of this study is to demonstrate that a well-designed reward function is essential for effective reinforcement learning (RL)-based autonomous navigation. Additional considerations include the utility of LiDAR data for state representation and the appropriateness of PPO for continuous control tasks. We evaluate the PPO agent's performance using two distinct reward models: a simpler reward-based model and an older, more complicated reward-based model. The analysis centers on key performance metrics, including episode rewards, distance traveled, episode lengths, and reward components, derived from the training progress at epoch 50 for the older model and the rewards vs. steps relationship for the simpler model.

Subsection 1: Performance of the Simpler Reward-Based Model

The performance of the PPO agent with the simpler reward-based model is illustrated in a scatter plot titled "Episode Reward vs Steps (Colored by Epoch)," which tracks episode rewards against the number of steps across epochs 35 to 72.

Early Epochs (35-40):
- During the initial training phase (epochs 35-40), episode rewards are consistently low, ranging between -6000 and -4000. At around 100 steps, rewards frequently reach-6000, reflecting poor performance early in training. As steps increase to 200, rewards remain predominantly negative, with significant variability-some points improve to -1000, while others drop back to -5000 or lower.
Mid Epochs (41-55):
- In the mid-training phase (epochs 41-55), the agent's performance begins to improve. Between 200 and 300 steps, episode rewards cluster between -4000 and -2000, with fewer points falling below -5000. Variability persists, with some episodes achieving rewards as high as -1000, while others remain around -4000, indicating inconsistent progress.
Later Epochs (56-72):
- By the later epochs (56-72), further improvement is evident. Between 300 and 450 steps, rewards concentrate between -3000 and -1000, with several points nearing 0 (e.g., around -500 at 400-450 steps). However, some outliers still drop to -4000, suggesting that the agent has not yet achieved fully stable performance.
Key Observation: The scatter plot reveals a clear upward trend in episode rewards as both steps and epochs increase, demonstrating that the PPO agent is learning and refining its navigation abilities under the simpler reward model. Nevertheless, the persistent variability in rewards indicates that additional training or model refinement may be required to achieve consistent and optimal performance.

Subsection 2: Performance of the Older, More Complicated Reward-Based Model (Training Progress at Epoch 50)

The older, more complicated reward-based model's performance is assessed using four charts depicting training progress over 100 epochs. Here, we focus specifically on the agent's behavior at epoch 50.

Episode Rewards:
- At epoch 50, the total reward is approximately 2,000. This follows a peak of around 10,000 at episode 40 and a subsequent decline. Across the 100 epochs, rewards fluctuate widely, with notable spikes at episode 10 (around 20,000), episode 60 (around 15,000), and episode 80 (around 5,000), but they generally hover around 2,000 outside these peaks.
Distance Traveled:
- The distance traveled at epoch 50 reaches approximately 120 meters, a significant peak compared to surrounding episodes, which typically range between 20 and 40 meters. Other prominent peaks include episode 10 (120 meters), episode 30 (100 meters), episode 70 (80 meters), and episode 90 (140 meters).
Episode Lengths:
- At epoch 50, the episode length is around 300 steps, relatively modest compared to earlier peaks like episode 10 (900 steps). Episode lengths generally fluctuate between 200 and 400 steps, with occasional spikes, such as 600 steps at episode 60 and 700 steps at episode 80.
Reward Components:
- The total reward at epoch 50 is dominated by the "progress" component (approximately 2,000), with a minor contribution from the "longevity" component (around 500). Other components-collision, lane invasion, lane centering, heading, speed, steering smoothness, survival bonus, and completion bonus-remain near zero, indicating negligible influence on the total reward.
Key Observation: At epoch 50, the older model yields a total reward of 2,000, with a notable distance traveled of 120 meters. The overwhelming contribution of the "progress" reward component suggests that the agent is primarily incentivized to move forward, while safety and efficiency metrics (e.g., collision avoidance, lane discipline) play a minimal role. This imbalance may result in suboptimal navigation behavior, as the agent prioritizes distance over other critical objectives.

Subsection 3: Comparative Analysis and Key Insights

Simpler Reward Model:
- The simpler model exhibits a steady increase in episode rewards, progressing from -6000 to near 0 across epochs 35 to 72. While this indicates effective learning, the high variability in rewards suggests that the agent has not yet converged to a stable, optimal policy.
- Without detailed reward component data, specific behavioral improvements (e.g., collision avoidance, lane adherence) are difficult to assess, but the overall trend reflects enhanced navigation capability.
Older, More Complicated Reward Model:
- The older model displays significant reward volatility, with occasional high-reward episodes (e.g., 20,000 at episode 10) followed by drops to lower values (e.g., 2,000 at epoch 50).Distance traveled and episode lengths also fluctuate widely.
- The reward components highlight a heavy reliance on "progress," with minimal contributions from safety or efficiency metrics. This skewed reward structure likely undermines consistent performance by favoring forward movement over balanced navigation.
Insight: The simpler reward model, despite its simplicity, shows more consistent improvement in rewards over time, suggesting that a straightforward, well-tuned reward function may outperform a complex one in this context. Conversely, the older model's focus on progress at the expense of other behaviors leads to erratic outcomes, underscoring the need for a balanced reward design.

Subsection 4: Impact of Reward Function on Agent Behavior

Simpler Reward Model:
The steady reward improvement implies that the agent is learning to balance basic navigation tasks, such as advancing while avoiding obstacles. However, the lack of reward component breakdown limits insight into specific behavioral advancements.
Older Reward Model:
The dominance of the "progress" component likely drives the agent to prioritize distance covered over safety or precision. The high distance traveled at epoch 50 (120 meters) despite a modest reward (2,000) suggests risky actions, such as neglecting collisions or lane discipline, which contribute little to the reward.
Observation: The older model's reward structure may encourage unsafe navigation by underweighting critical safety and efficiency metrics. This highlights the importance of a reward function that integrates multiple objectives—progress, safety, and adherence—to foster robust autonomous navigation.

Subsection 5: Moving to a Simpler Domain - DonkeySim

Why DonkeySim?: While we were satisfied with our results from our CARLA simulations, we also wanted to test out another autonomous vehicle simulator: DonkeySim. Our goal was to investigate if we would end up with similar results as above if we used a similar model structure in a simpler DonkeySim domain instead of a highly complex and realistic CARLA domain.

Procedure & Setup: To get a good understanding of the relationship between map complexity and model performance, we decided to test a singular model - stable_baselines3's PPO with CNN-policy - across three different DonkeySim Maps: Waveshore, Warren Track, & Mountain Track. Waveshore is the smallest and simplest map, being a simple loop with minimal obstacles or changes in elevation. We classified Mountain Track as a "Medium" map, since there were minimal obstacles. However, as we would find out later, this map was more complicated than hypothesized becuase of the changes in elevation adding a hurdle for the car to have to overcome. Lastly, the UCSD-based Warren track, which was the most difficult map due to its abundance of obstacles, twisting-and-turning map, and small width of the track itself.

Results:

Waveshore ("Easy") - In this simple looped track, our agent was able to converge and (mostly) adhere to the path that it was supposed to follow after a few minutes of training! This was a landmark discovery for us, since it showed that PPO does have opporunity for success, however it may just be limited to small, toy enviornments with limited distractions. In either case, our training data showed an ideal trajectory, with episodes mean reward steadily increasing from the beginning to the end of the training process, where you can see in our video that the DonkeyCar follows the trajectory of the track, although it isn't perfect and is still susceptible to going off track at times.

Start of Training

End of Training

Mountain Track ("Medium") - The Mountain Track was next in our training process, and over the course of the training period the car seemed to be making significant strides in terms of moving ahead in the map. However, this progress wasn't perfect and for every good run the car made, it followed it up with a few "dumb" runs where it decided to spin in circles until it crashed or got stuck on the elevation differences, and you can clearly see these fluctuations in the above graph. An unexpected problem was the fact that the car struggled with the leveling of the track, since the mountainous track wasn't smooth and had ups and downs which served as obstacles the car had to navigate. While the car wasn't able to complete the track in the allotted training period, it was able to make significant progress. As we will discuss in the limitations section in more detail, we couldn't simple increase time steps to wait for convergence as DonkeySim's simulator began crashing.

Start of Training

End of Training

Warren Track ("Hard") - The most difficult of the three tracks, our agent was unable to make any progress in this map. It would consistently crash into the barriers on the side of the track, and when it did manage to make it past the first turn it would then crash into the next wall. This was a consistent pattern that we saw in our training data, and the data reflected this as well. You can see that the mean episode reward stayed consistently low throughout the entire training period, and in our video you can see that the car doesn't even make it past the first turn. This was a big learning experience for us, as it showed that even though PPO is a very powerful RL algorithm, it is still susceptible to poor performance in complex environments.

Start of Training

End of Training

Discussion

The results from the previous section highlights the importance of the complexity of the environment in PPO's overall performance. As we saw in the previous section, PPO was able to perform well in the simplest map, Waveshore, but struggled in the more complex maps, Mountain Track and Warren Track. This suggests that PPO may be better suited for simpler environments, or that more training is needed for PPO to perform well in complex environments.

The results from our CARLA experiments also highlights the importance of the reward function in PPO's overall performance. As we saw in the previous section, PPO was able to perform better with the simpler reward function than the more complex reward function. This suggests that a simpler reward function may be better suited for PPO, or that more tuning is needed for a complex reward function to perform well with PPO.

Limitations

During our experiments, we encountered several limitations that affected our methodology and outcomes.

Computational Resources: Training reinforcement learning models, especially in complex environments like CARLA, requires substantial computational power. Limited access to high-performance computing resources restricted the scale and duration of our experiments.
Toy Environment: While CARLA provides a high-fidelity simulation environment, it is still an approximation of real-world conditions. Discrepancies between the simulation and the real world may limit the transferability of our results.
Model Complexity: The complexity of the models and algorithms used in reinforcement learning can make it challenging to interpret and optimize their performance. We relied on established architectures and methodologies, but there is always room for further refinement and innovation.
Hyperparameter Tuning: Reinforcement learning models have many hyperparameters that can significantly impact their performance. Tuning these hyperparameters effectively requires extensive experimentation and expertise. Due to resource constraints, we were not able to explore the full range of hyperparameter combinations.
DonkeySim Limitations: As briefly discussed in the Results section, we ran into some limitations when it came to using DonkeySim. The simulator would often crash and we would have to restart it, which prevented us from running our experiments for longer periods of time. This was a significant limitation, as it prevented us from being able to fully explore the capabilities of PPO in DonkeySim.

Future Work

To build upon the findings of this project, several avenues for future work can be explored:

Enhanced Reward Shaping: Investigate more sophisticated reward shaping techniques to guide the agent's learning process more effectively. This could involve designing reward functions that incentivize specific behaviors, such as smooth driving, energy efficiency, or passenger comfort.
Curriculum Learning: Implement a curriculum learning approach where the agent is gradually exposed to more complex environments and tasks. This could help the agent learn more robust and generalizable policies.
Multi-Agent Reinforcement Learning: Extend the project to multi-agent scenarios where multiple autonomous vehicles interact with each other. This would allow for the exploration of cooperative and competitive behaviors in autonomous driving.
Real-World Validation: Transfer the trained models to real-world autonomous vehicles and evaluate their performance in actual driving conditions. This would require addressing challenges such as sensor noise, environmental variability, and safety considerations.
Advanced Model Architectures: Explore the use of more advanced model architectures, such as attention mechanisms or graph neural networks, to improve the agent's ability to perceive and reason about its environment.
DonkeySim: Given the limitations we faced when using DonkeySim, future work could focus on addressing these limitations. This could involve finding ways to prevent the simulator from crashing, or finding alternative simulators that are more stable.

Ethics & Privacy

The development and deployment of autonomous vehicles raise several ethical and privacy considerations that must be addressed:

Safety: Ensuring the safety of autonomous vehicles is paramount. It is crucial to thoroughly test and validate these systems to minimize the risk of accidents and injuries.
Bias: Reinforcement learning models can inadvertently learn and perpetuate biases present in the training data. It is important to carefully curate training datasets and employ techniques to mitigate bias in the models.
Privacy: Autonomous vehicles collect and process vast amounts of data about their surroundings and occupants. It is essential to protect the privacy of individuals by implementing robust data security measures and adhering to relevant privacy regulations.
Accountability: Determining liability in the event of an accident involving an autonomous vehicle is a complex legal and ethical challenge. Clear guidelines and regulations are needed to establish accountability and ensure fair compensation for victims.
Transparency: It is important to make the decision-making processes of autonomous vehicles transparent and understandable to the public. This can help build trust and acceptance of these technologies.

Conclusion

This project explored the application of reinforcement learning for autonomous vehicle navigation in DonkeySim and CARLA. By implementing and evaluating Proximal Policy Optimization (PPO) and Actor-Critic methods, we demonstrated their potential in training self-driving agents within a simulated environment. Our results highlighted the importance of a well-designed reward function for effective learning but also revealed challenges in balancing safety, lane adherence, and progress incentives. Additionally, model performance remained highly sensitive to training conditions, hyperparameter choices, and the diversity of training environments.

While RL shows promise in autonomous driving, significant challenges remain in bridging the sim-to-real gap, ensuring safety, and addressing ethical concerns such as bias and transparency. Future work should focus on optimizing learning efficiency, deploying models on real-world robotic platforms, and integrating privacy safeguards to align with regulatory standards. As reinforcement learning continues to evolve, advancements in reward shaping, model-based RL, real-world validation, and ethical AI frameworks will be critical in bringing AI-driven autonomous systems closer to practical deployment.

Footnotes

^[1] Kiran et al. (2021). "Deep Reinforcement Learning for Autonomous Driving: A Survey." A comprehensive review of RL applications in self-driving cars. Link
^[2] Peng et al. (2018). "Sim-to-Real Transfer of Robotic Control with Dynamics Randomization." A study on addressing the simulation-to-reality gap in RL applications. Link
^[3] Taylor & Stone (2009). "Transfer Learning for Reinforcement Learning Domains: A Survey." A discussion on transfer learning techniques to improve RL generalization. Link
^[4] Donkey Simulator. Official Documentation. Link
^[5] CARLA Simulator. Official Documentation. Link
^[6] PPO. Stable-Baselines3 Documentation. A widely used reinforcement learning algorithm for continuous control. Link
^[7] SAC. Stable-Baselines3 Documentation. A reinforcement learning algorithm based on the Actor-Critic framework with entropy regularization. Link