Skip to main content

Command Palette

Search for a command to run...

Maze Navigation with Reinforcement Learning

Updated
2 min read

Related Post:

This post explores how to solve a maze using Proximal Policy Optimization (PPO) within a custom MuJoCo environment I built previously. We'll start by understanding PPO. The core mechanism of PPO is 'clipping,' which keeps policy updates within a safe range and prevents extreme changes.

The logic is quite straightforward and is based on the Right-Hand Rule. If the car drifts too far from the right wall or fails to stay parallel, it loses a reward. Furthermore, if it gets too close to the wall, it is also penalized. Conversely, maintaining the correct distance and alignment yields a positive reward. To implement this, we use LiDAR sensors to continuously measure the distance to the wall. The code below demonstrates this logic:

// ... rest of the code ...
forward, right_forward, right, right_back = obs

reward += ((action[0] + action[1]) / 2) * 1.5
reward += 0.5

if right < 1.5:
    parallel_error = abs(right_back - right_forward)
    dist_error = abs(right - self.target_dist)
    steering_penalty = abs(action[0] - action[1])
    reward -= parallel_error * 3
    reward -= dist_error * 2
    reward -= steering_penalty * 0.5
else:
    if action[0] > action[1]:
        reward += (action[0] - action[1]) * 3.0

if forward < 1.5:
    reward -= (1.5 - forward) * 10
    if action[1] > action[0]:
        reward += (action[1] - action[0]) * 3.0

if right_forward < 1.5 and right_back < 1.5:
    if action[1] > action[0]:
        reward += (action[1] - action[0]) * 3.0

if min(obs) < 0.3:
    reward -= 200.0
    terminated = True

info = {}
return obs, float(reward), terminated, truncated, info

Through this project, I realized that reward design is the key to reinforcement learning. Simply increasing penalties or rewards is not a perfect solution. In complex environments, excessive penalties often lead to the agent giving up (inaction), while excessive rewards cause reward hacking. Although this project isn't completely flawless, it successfully solved the core problem to a meaningful extent.

https://youtu.be/o87S0XbEy3A

Finally, you can watch the final result of the simulation in the video above. The complete source code for this project is available on my GitHub repository.