AutoBots: Using Reinforcement Learning with Carla & DonkeySim

COGS 188 [Artificial Intelligence Algorithms] - Final Project

Group Members

Abstract

The goal of this project is to train a Reinforcement Learning (RL) Classifier on autonomous vehicles. We plan to use both CARLA and DonkeyCar Simulator to navigate our vehicle. CARLA will provide a complex urban driving environment, while the DonkeyCar simulator will be used for a simpler track-based navigation on it's “Warren Field” circuit. We plan to solely rely on a Lidar sensor for data collection because of its robustness in capturing depth information and obstacle detection regardless of lighting conditions, an advantage over typical computer vision based data collection. We implement two deep RL algorithms: Actor-Critic and Proximal Policy Optimization (PPO), both designed for continuous action spaces since algorithms like simple a simple Q learning to are ineffective for problems in continious action spaces. We will use the gathered data to train agents to take optimal actions such as steering, acceleration, and braking based on the cars current position relative to the world. Performance will be evaluated using key metrics such as cumulative reward, lap completion time, and distance travelled. By comparing these metrics across different models and training scenarios, we aim to determine which RL method provides the most robust and efficient control for autonomous driving in simulated environments.

Background

Autonomous driving has rapidly advanced due to improvements in computing power and machine learning, particularly reinforcement learning (RL), which enables autonomous agents to learn optimal control strategies through trial and error. Unlike traditional rule-based or supervised learning methods, RL-based approaches can dynamically adapt to new environments and uncertain conditions, making them well-suited for self-driving applications [1]. However, a significant challenge in RL-driven autonomous navigation is the simulation-to-reality gap, where models trained in virtual environments struggle to perform reliably in real-world settings due to differences in sensor noise, road textures, and unexpected obstacles [2]. Research on domain adaptation and transfer learning continues to address this issue by fine-tuning RL policies with real-world data [3].

Simulation environments play a crucial role in training RL-based autonomous driving models. DonkeyCar Simulator (DonkeySim) provides a lightweight platform for testing self-driving models on controlled tracks and is widely used due to its accessibility and ease of experimentation [4]. The simulator provides key sensory inputs such as camera feeds, speed readings, and steering angles, enabling the development of reinforcement learning pipelines without requiring physical hardware. For more complex urban driving scenarios, CARLA offers a high-fidelity simulation environment with dynamic traffic, weather conditions, and diverse road layouts [5]. CARLA allows for more extensive testing of RL models in realistic settings, making it a crucial tool for benchmarking autonomous navigation systems.

Recent advancements in policy optimization techniques have improved training stability and efficiency in reinforcement learning for autonomous driving. Proximal Policy Optimization (PPO) has emerged as a preferred method for continuous control tasks due to its ability to balance exploration and exploitation while preventing overly large policy updates [6]. Additionally, Actor-Critic methods provide an effective framework for reinforcement learning by combining value-based and policy-based learning, resulting in more stable and informed decision-making in autonomous navigation tasks [7].

Self-driving technology has the potential to improve road safety, efficiency, and accessibility, but achieving reliable autonomy remains a challenge. Reinforcement learning-based methods, combined with improvements in simulation environments, optimized reward functions, and safer real-world deployment strategies, can contribute to the refinement of autonomous navigation systems. Continued research on policy optimization, explainable AI, and real-world generalization will be essential for making self-driving technology both practical and ethically responsible.

Why is this Important?

Autonomous driving stands to improve road safety, increase mobility, and reduce congestion. However, it also introduces unique challenges in perception, planning, and control. Studying reinforcement learning in this domain is crucial for advancing algorithms that can handle high-dimensional state spaces and continuous action controls, ultimately bringing us closer to reliable self-driving cars.

Problem Statement

Autonomous navigation for industrial and factory environments requires precise and efficient vehicle control to ensure safe and timely transportation of goods. Traditional rule-based and vision-based approaches struggle with real-time adaptability and robustness in dynamic settings where numerous unexpected obstacles may arrise due to minor mishaps. Through our project, we aim to develop a deep reinforcement learning (RL) model that enables autonomous vehicles to navigate factory environments using only LiDAR data as input. By leveraging reinforcement learning techniques, particularly Proximal Policy Optimization (PPO) and Actor-Critic methods, we aim to train a model capable of handling continuous action spaces while minimizing computational complexity. Our goal is to create an efficient, collision-free, and fast driving policy that enhances safety, accuracy, and cost-effectiveness in automated logistics and manufacturing operations.

Data

For this reinforcement learning project, we generated training data through interactions with the CARLA simulator, a high-fidelity environment designed for autonomous vehicle research. Unlike traditional datasets, our approach relies on real-time sensory inputs from the simulator, with LiDAR data serving as the cornerstone of our state representation.

LIDAR Data Collection for CARLA Implementation

DonkeySim Implementation

Proposed Solution

  1. State Representation
    • The vehicle relies solely on LiDAR data, processed into an 18-dimensional state vector (16 LIDAR features + 2 waypoint features), as detailed in the Data section.
    • This design ensures a low computational footprint and real-time decision-making capability.
  2. Action Space
    • The action space is continuous, with two dimensions: throttle/brake ([-1, 1], where positive values are throttle and negative are brake) and steering ([-1, 1]).
    • This allows the agent to dynamically adjust speed and direction, learning the interplay between velocity and turning for smooth navigation.
  3. Reinforcement Learning Approach
    • We implemented Proximal Policy Optimization (PPO) using PyTorch Lightning, training the agent in the CARLA simulator.
    • We used Proximal Policy Optimization (PPO) from stable-baselines3, training the agent in the DonkeySim simulator.
    • PPO balances exploration and exploitation, making it ideal for our continuous control task.
  4. Neural Network Architecture

    The PPO agent relies on two distinct neural networks: the actor, which determines the policy (action selection), and the critic, which estimates the value function. These networks are designed as multi-layer perceptrons (MLPs) with the following structures:

    • Actor Network:
      • Structure: A four-layer MLP:
        • Input Layer: 18 neurons, corresponding to the state dimension (e.g., sensor data, velocity, etc.).
        • Hidden Layers: Three layers with 256, 256, and 128 neurons, respectively, each followed by Tanh activation functions.
        • Output Layer: 4 neurons, representing the mean and log standard deviation (log_std) for two continuous actions: throttle and steering (2 neurons per action).
      • Output Processing:
        • Throttle Mean: Passed through a sigmoid function to produce values in the range [0,1], biasing the agent toward forward movement.
        • Steering Mean: Passed through a tanh function to produce values in the range [-1,1], enabling smooth left and right turns.
        • Standard Deviation: The log_std outputs are exponentiated, clamped, and constrained to a minimum value to ensure sufficient exploration during training.
      • Initialization: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
      • Purpose: The Tanh activations help stabilize policy updates, while the split output design accommodates the continuous action space required for driving control.
    • Critic Network:
      • Structure: A four-layer MLP:
        • Input Layer: 18 neurons, matching the state dimension.
        • Hidden Layers: Three layers with 256, 256, and 128 neurons, respectively, each followed by ReLU activation functions.
        • Output Layer: 1 neuron, providing the value estimate for the given state.
      • Initialization: Weights are initialized using Xavier uniform initialization, and biases are set to zero.
      • Purpose: The ReLU activations support effective value approximation, enabling the critic to provide stable and accurate estimates of the state's expected return.
  5. Reward Function Design

    The reward function evolved iteratively to guide the agent toward safe, efficient, and route- following behavior:

    • Initial Reward Function:
      • Collision Avoidance: A penalty of -50 was applied for collisions to prioritize safety.
      • Speed Maintenance: Reward was proportional to distance traveled per step, encouraging forward movement.
      • This basic design promoted movement while avoiding obstacles but lacked route guidance.
    • Intermediate Reward Function:
      • Lane Discipline: Added a -1 penalty for lane invasions to keep the vehicle within track boundaries.
      • Speed Regulation: Introduced a target speed of 30 km/h, with a penalty (-0.1 * speed - target ) for deviations, and an additional -1 penalty for speeds below 5 km/h to prevent stalling.
      • Steering Smoothness: Penalized large steering actions (-0.5 * |steering|) when speed was below 5 km/h to reduce erratic behavior at low speeds.
      • This improved track adherence and consistency but didn't ensure progress along a specific path.
    • Final Reward Function:
      • Waypoint Following: Added a reward based on proximity to the next waypoint (max(0,5-distance/10)), encouraging route adherence.
      • Heading Alignment: Included a bonus (max(0,1-|angle_diff|/180)) for aligning the vehicle's heading with the waypoint direction, promoting smoother turns.
      • Progress Reward: Retained distance traveled as a base reward, augmented by waypoint incentives.
      • Safety Penalties: Kept collision (-50) and lane invasion (-1) penalties.
      • Stuck Detection: Penalized (-2) if the vehicle's position varied by less than 1 meter over 20 steps, preventing circular or stagnant behavior.
      • Implemented in CarlaEnvWrapper.step, this final version balances safety, efficiency, and navigation.
  6. Deployment and Applications

    Setup deployment details

    A full list complete setup details can be found in the code repo's README. Essentially the following items must be setup:

    1. Carla 0.9.15 must be setup on a gpu based machine, as well as the python api for the same version for full compatibility
    2. Python must be setup with respective packages
    3. The car is setup in the default map with the default settings. It is a real world simulation, but with no NPCs to reduce complexity given the projects scale.

    Potential Future applications

    • The trained PPO model can optimize logistics in factory settings, enabling autonomous vehicles to transport goods safely and efficiently along predefined routes.
    • This approach reduces costs and enhances precision in industrial automation.

Evaluation Metrics

  1. Cumulative Reward
    • Definition: The total reward accumulated over an episode, calculated based on the reward function defined in your project.
    • Significance: This metric reflects the overall performance of the agent. Higher cumulative rewards indicate better navigation, fewer collisions, and more effective adherence to the intended route. It serves as a primary indicator of policy improvement during training.
  2. Collision Rate
    • Definition: The frequency of collisions with obstacles or boundaries during an episode.
    • Significance: A lower collision rate is desirable, as it demonstrates the agent's ability to navigate safely and avoid hazards. This metric is critical for evaluating the safety performance of the driving policy.
  3. Distance Traveled
    • Definition: The total distance covered by the vehicle over the course of an episode.
    • Significance: When paired with a low collision rate, a higher distance traveled suggests efficient and effective navigation. This metric highlights the agent's progress and ability to follow the desired path.

Results

Reference the videos here: https://drive.google.com/drive/folders/1K7cDpq456Woh-fQq7A6EAID3Nz-YboTQ?usp=sharing

The primary objective of this study is to demonstrate that a well-designed reward function is essential for effective reinforcement learning (RL)-based autonomous navigation. Additional considerations include the utility of LiDAR data for state representation and the appropriateness of PPO for continuous control tasks. We evaluate the PPO agent's performance using two distinct reward models: a simpler reward-based model and an older, more complicated reward-based model. The analysis centers on key performance metrics, including episode rewards, distance traveled, episode lengths, and reward components, derived from the training progress at epoch 50 for the older model and the rewards vs. steps relationship for the simpler model.

Subsection 1: Performance of the Simpler Reward-Based Model

Training Progress Epoch 50

The performance of the PPO agent with the simpler reward-based model is illustrated in a scatter plot titled "Episode Reward vs Steps (Colored by Epoch)," which tracks episode rewards against the number of steps across epochs 35 to 72.

Subsection 2: Performance of the Older, More Complicated Reward-Based Model (Training Progress at Epoch 50)

Training Progress Epoch 50

The older, more complicated reward-based model's performance is assessed using four charts depicting training progress over 100 epochs. Here, we focus specifically on the agent's behavior at epoch 50.

Subsection 3: Comparative Analysis and Key Insights

Subsection 4: Impact of Reward Function on Agent Behavior

Subsection 5: Moving to a Simpler Domain - DonkeySim

Why DonkeySim?: While we were satisfied with our results from our CARLA simulations, we also wanted to test out another autonomous vehicle simulator: DonkeySim. Our goal was to investigate if we would end up with similar results as above if we used a similar model structure in a simpler DonkeySim domain instead of a highly complex and realistic CARLA domain.

Procedure & Setup: To get a good understanding of the relationship between map complexity and model performance, we decided to test a singular model - stable_baselines3's PPO with CNN-policy - across three different DonkeySim Maps: Waveshore, Warren Track, & Mountain Track. Waveshore is the smallest and simplest map, being a simple loop with minimal obstacles or changes in elevation. We classified Mountain Track as a "Medium" map, since there were minimal obstacles. However, as we would find out later, this map was more complicated than hypothesized becuase of the changes in elevation adding a hurdle for the car to have to overcome. Lastly, the UCSD-based Warren track, which was the most difficult map due to its abundance of obstacles, twisting-and-turning map, and small width of the track itself.

Results:

Discussion

The results from the previous section highlights the importance of the complexity of the environment in PPO's overall performance. As we saw in the previous section, PPO was able to perform well in the simplest map, Waveshore, but struggled in the more complex maps, Mountain Track and Warren Track. This suggests that PPO may be better suited for simpler environments, or that more training is needed for PPO to perform well in complex environments.

The results from our CARLA experiments also highlights the importance of the reward function in PPO's overall performance. As we saw in the previous section, PPO was able to perform better with the simpler reward function than the more complex reward function. This suggests that a simpler reward function may be better suited for PPO, or that more tuning is needed for a complex reward function to perform well with PPO.

Limitations

During our experiments, we encountered several limitations that affected our methodology and outcomes.

Future Work

To build upon the findings of this project, several avenues for future work can be explored:

Ethics & Privacy

The development and deployment of autonomous vehicles raise several ethical and privacy considerations that must be addressed:

Conclusion

This project explored the application of reinforcement learning for autonomous vehicle navigation in DonkeySim and CARLA. By implementing and evaluating Proximal Policy Optimization (PPO) and Actor-Critic methods, we demonstrated their potential in training self-driving agents within a simulated environment. Our results highlighted the importance of a well-designed reward function for effective learning but also revealed challenges in balancing safety, lane adherence, and progress incentives. Additionally, model performance remained highly sensitive to training conditions, hyperparameter choices, and the diversity of training environments.

While RL shows promise in autonomous driving, significant challenges remain in bridging the sim-to-real gap, ensuring safety, and addressing ethical concerns such as bias and transparency. Future work should focus on optimizing learning efficiency, deploying models on real-world robotic platforms, and integrating privacy safeguards to align with regulatory standards. As reinforcement learning continues to evolve, advancements in reward shaping, model-based RL, real-world validation, and ethical AI frameworks will be critical in bringing AI-driven autonomous systems closer to practical deployment.

Footnotes

  1. [1] Kiran et al. (2021). "Deep Reinforcement Learning for Autonomous Driving: A Survey." A comprehensive review of RL applications in self-driving cars. Link
  2. [2] Peng et al. (2018). "Sim-to-Real Transfer of Robotic Control with Dynamics Randomization." A study on addressing the simulation-to-reality gap in RL applications. Link
  3. [3] Taylor & Stone (2009). "Transfer Learning for Reinforcement Learning Domains: A Survey." A discussion on transfer learning techniques to improve RL generalization. Link
  4. [4] Donkey Simulator. Official Documentation. Link
  5. [5] CARLA Simulator. Official Documentation. Link
  6. [6] PPO. Stable-Baselines3 Documentation. A widely used reinforcement learning algorithm for continuous control. Link
  7. [7] SAC. Stable-Baselines3 Documentation. A reinforcement learning algorithm based on the Actor-Critic framework with entropy regularization. Link