ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Yarden As *

ETH Zurich

Bhavya Sukhija *

ETH Zurich

Lenart Treven

ETH Zurich

Stelian Coros

ETH Zurich

Andreas Krause

ETH Zurich

ICLR 2025

*Equal Contribution

Demo on Humanoid Robot
Demo on SafetyGym
Demo on Cartpole

Abstract

Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe maintains a pessimistic set of safe policies and optimistically selects policies within this set that yield trajectories with the largest model epistemic uncertainty.

Key Idea

ActSafe learns a probabilistic model of the dynamics, including its epistemic uncertainty, and leverages it to collect trajectories that maximize the information gain about the dynamics. To ensure safety, ActSafe plans pessimistically w.r.t. its set of plausible models and thus implicitly maintains a (pessimistic) set of policies that are deemed to be safe with high probability.

Expansion Process
Schematic illustration of the expansion process. We expand the safe set at each iteration by reducing our uncertainty around policies at the boundary of the previous pessimistic safe set. The pale blue area depicts the reachable set after H expansions.

Concretely we want to solve

πn,fn=arg maxπSn,fMnEτπ,f[t=0T1σn1(st,s^t)]:=Jrn(π,f), \pi_n, f_n = \argmax_{\pi \in \mathcal{S}_n, f \in \mathcal{M}_n} \underbrace{\mathbb{E}_{\tau^{\pi, f}}\left[\sum_{t=0}^{T-1} \|{\sigma_{n-1}(s_t, \hat{s}_t)\|}\right]}_{:= J_{r_n}(\pi, f)},

Where σn1\sigma_{n - 1} represents our epistemic uncertainty over a model of the dynamics. Intuitevly, selecting a policy πn\pi_n that “navigates” to states with high uncertainty allows us to collect information more efficiently, all while staying within the pessimistic safe set of policies Sn\mathcal{S_n}.

Experiments

Pendulum

We evaluate ActSafe on the Pendulum environment. We visualize the trajectories of ActSafe and its unsafe variant in the state space during exploration. We observe that both algorithms cover the state space well, however, ActSafe remains within the safety boundary during learning whereas its unsafe version violates the constraints.

Pendulum safe exploration
Safe exploration in the PendulumSwingup task. Each plot above visualizes trajectories considered during exploration across all past learning episodes. The red box in the plot depicts the safety boundary in the state space. ActSafe maintains safety throughout learning.

Cartpole

We evaluate on CartpoleSwingupSparse from the RWRL benchmark, where the goal is to swing up the pendulum, while keeping the cart at the center. We add penalty for large actions to make exploration even more challenging. We compare ActSafe with three baselines:

Cartpole Exploration
Hard exploration on Cartpole.

Sparse-reward Navigation

In this experiment, we examine the influence of using an intrinsic reward in hard exploration tasks. To this end, we extend tasks from SafetyGym and introduce three new tasks with sparse rewards, i.e., without any reward shaping to guide the agent to the goal. We provide more details about the rewards in the figure below. In the figure below we compare ActSafe with a Greedy baseline that collects trajectories only based on the sparse extrinsic reward. As shown, ActSafe substantially outperforms Greedy in all tasks, while violating the constraint only once in the GotoGoal task.

Expansion Process
Hard exploration in navigation tasks.

Cite

  @misc{as2024actsafeactiveexplorationsafety,
      title={ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning}, 
      author={Yarden As and Bhavya Sukhija and Lenart Treven and Carmelo Sferrazza and Stelian Coros and Andreas Krause},
      year={2024},
      eprint={2410.09486},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
}