ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Yarden As ^*

ETH Zurich

Bhavya Sukhija ^*

ETH Zurich

Lenart Treven

ETH Zurich

Carmelo Sferrazza

UC Berkeley

Stelian Coros

ETH Zurich

Andreas Krause

ETH Zurich

ICLR 2025

^*Equal Contribution

Code arXiv Paper

Abstract

Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe maintains a pessimistic set of safe policies and optimistically selects policies within this set that yield trajectories with the largest model epistemic uncertainty.

Key Idea

ActSafe learns a probabilistic model of the dynamics, including its epistemic uncertainty, and leverages it to collect trajectories that maximize the information gain about the dynamics. To ensure safety, ActSafe plans pessimistically w.r.t. its set of plausible models and thus implicitly maintains a (pessimistic) set of policies that are deemed to be safe with high probability.

Expansion Process — Schematic illustration of the expansion process. We expand the safe set at each iteration by reducing our uncertainty around policies at the boundary of the previous pessimistic safe set. The pale blue area depicts the reachable set after H expansions.

Concretely we want to solve

\pi_n, f_n = \argmax_{\pi \in \mathcal{S}_n, f \in \mathcal{M}_n} \underbrace{\mathbb{E}_{\tau^{\pi, f}}\left[\sum_{t=0}^{T-1} \|{\sigma_{n-1}(s_t, \hat{s}_t)\|}\right]}_{:= J_{r_n}(\pi, f)},

Where $\sigma_{n - 1}$ represents our epistemic uncertainty over a model of the dynamics. Intuitevly, selecting a policy $\pi_n$ that “navigates” to states with high uncertainty allows us to collect information more efficiently, all while staying within the pessimistic safe set of policies $\mathcal{S_n}$ .

Experiments

Pendulum

We evaluate ActSafe on the Pendulum environment. We visualize the trajectories of ActSafe and its unsafe variant in the state space during exploration. We observe that both algorithms cover the state space well, however, ActSafe remains within the safety boundary during learning whereas its unsafe version violates the constraints.

Cartpole

We evaluate on CartpoleSwingupSparse from the RWRL benchmark, where the goal is to swing up the pendulum, while keeping the cart at the center. We add penalty for large actions to make exploration even more challenging. We compare ActSafe with three baselines:

Uniform, which samples actions uniformly at random during exploration.
Optimistic, which uses the model epistemic uncertainty estimates as exploration reward bonuses.
Greedy, which optimizes the extrinsic reward directly.

Cartpole Exploration — Hard exploration on Cartpole.

In this experiment, we examine the influence of using an intrinsic reward in hard exploration tasks. To this end, we extend tasks from SafetyGym and introduce three new tasks with sparse rewards, i.e., without any reward shaping to guide the agent to the goal. We provide more details about the rewards in the figure below. In the figure below we compare ActSafe with a Greedy baseline that collects trajectories only based on the sparse extrinsic reward. As shown, ActSafe substantially outperforms Greedy in all tasks, while violating the constraint only once in the GotoGoal task.

Cite

  @inproceedings{
  as2025actsafe,
  title={ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning},
  author={Yarden As and Bhavya Sukhija and Lenart Treven and Carmelo Sferrazza and Stelian Coros and Andreas Krause},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=aKRADWBJ1I}
}

Abstract

Key Idea

Experiments

Pendulum

Cartpole

Sparse-reward Navigation

Cite