Reinforcement Learning: Escaping a Maze with PPO

Related Post:

Implementing Autonomous Target Navigation in MuJoCo via the Right-Hand Rule

This post explores how to solve a maze using Proximal Policy Optimization (PPO) within a custom MuJoCo environment I built previously. We'll start by understanding PPO. The core mechanism of PPO is 'clipping,' which keeps policy updates within a safe range and prevents extreme changes.

The logic is quite straightforward and is based on the Right-Hand Rule. If the car drifts too far from the right wall or fails to stay parallel, it loses a reward. Furthermore, if it gets too close to the wall, it is also penalized. Conversely, maintaining the correct distance and alignment yields a positive reward. To implement this, we use LiDAR sensors to continuously measure the distance to the wall. The code below demonstrates this logic:

// ... rest of the code ...
forward, right_forward, right, right_back = obs

reward += ((action[0] + action[1]) / 2) * 1.5
reward += 0.5

if right < 1.5:
    parallel_error = abs(right_back - right_forward)
    dist_error = abs(right - self.target_dist)
    steering_penalty = abs(action[0] - action[1])
    reward -= parallel_error * 3
    reward -= dist_error * 2
    reward -= steering_penalty * 0.5
else:
    if action[0] > action[1]:
        reward += (action[0] - action[1]) * 3.0

if forward < 1.5:
    reward -= (1.5 - forward) * 10
    if action[1] > action[0]:
        reward += (action[1] - action[0]) * 3.0

if right_forward < 1.5 and right_back < 1.5:
    if action[1] > action[0]:
        reward += (action[1] - action[0]) * 3.0

if min(obs) < 0.3:
    reward -= 200.0
    terminated = True

info = {}
return obs, float(reward), terminated, truncated, info

Through this project, I realized that reward design is the key to reinforcement learning. Simply increasing penalties or rewards is not a perfect solution. In complex environments, excessive penalties often lead to the agent giving up (inaction), while excessive rewards cause reward hacking. Although this project isn't completely flawless, it successfully solved the core problem to a meaningful extent.

https://youtu.be/o87S0XbEy3A

Finally, you can watch the final result of the simulation in the video above. The complete source code for this project is available on my GitHub repository.

Maze Navigation with Reinforcement Learning

Comments

Learning

Implementing Autonomous Target Navigation in MuJoCo via the Right-Hand Rule

More from this blog

Implementing Autonomous Target Navigation in MuJoCo via the Right-Hand Rule

Implementation of a Pick-and-Place Task with a 6-Axis Robot using Numerical Methods

[Paper Review] Robot Operating System 2: Design, Architecture, and Uses In The Wild

Implementation of a Pick-and-Place Task with a 6-Axis Robot

Command Palette

Comments

Learning

Implementing Autonomous Target Navigation in MuJoCo via the Right-Hand Rule

More from this blog