Safe Exploration via Policy Priors

Manuel Wendl *

ETH Zurich

Yarden As *

ETH Zurich

Anton Pollak

ETH Zurich

Stelian Coros

ETH Zurich

Andreas Krause

ETH Zurich

Preprint

*Equal Contribution

SOOPER: Safe Online Optimism for Pessimistic Expansion in RL

SOOPER (Safe Online Optimism for Pessimistic Expansion in RL) is a reinforcement learning algorithm designed to ensure safe exploration while learning in real-world environments. SOOPER uses prior policies that can be derived from scarce offline data or simulation under distribution shifts. Such policies are invoked pessimistically during online rollouts to maintain safety. These rollouts are collected optimistically so as to maximize information about a world model of the environment. Using this model, SOOPER constructs a simulated RL environment, used for planning and exploration, whose trajectories terminate once the prior policy is invoked. The key idea is that early terminations incentivize the agent to avoid the prior policy when trajectories with higher returns can be obtained safely. This key idea is illustrated below.

Safety Guarantee via Online Cost Tracking

Idea behind SOOPER.

In the example above, the agent’s goal is to reach the cross marker while avoiding obstacles (tires). The agent deploys an exploration policy πn\pi_n, however at time tt, it switches to the prior policy π^\hat \pi, to maintain the safety criterion Φ(at,st,c<t,Qc,nπ^)=τ=0t1γτc(st,at)+Qc,nπ^(st,at)<d\Phi(a_t,s_t,c_{<t},Q_{c,n}^{\hat \pi}) = \sum_{\tau = 0}^{t - 1} \gamma^\tau c(s_t, a_t) + Q_{c,n}^{\hat \pi}(s_t, a_t) < d. The prior policy π^\hat \pi ensures safety by following a conservative route, but sacrifices performance, resulting in a lower return Vrπ^(st)\underline{V}^{\hat \pi}_r(s_t). The trajectory of iteration nn is recorded to improve models of subsequent iterations. After NN iterations, as more data is gathered, the agent learns a more rewarding trajector via model-generated rollouts that terminate with terminal reward of the expected accumulated future rewards Vrπ^(st)\underline{V}^{\hat \pi}_r(s_t) the expected accumulated future reward following the pessimistic prior policy.

Expansion and Exploration-Exploitation

Discovering new optimal behaviors require learners to balance between exploring new promising behaviors and exploiting behaviors that are already known to be rewarding. When safety is not required during learning, this “exploration-exploitation” dilemma is known to be efficiently solved by being “optimistic in face of uncertainty”. Intuitively, by assuming that things that we do not know will yield positive outcomes; if we were proved wrong, and outcomes were negative, we acquired new information, and now know not to go there.

In contrast, in safe exploration, we cannot explore freely, but only try out behaviors (or more formally, policies) that we are certain to be safe. More explicitly, we can only explore-exploit within a set that is known to be safe with high probability. Crucially, this set is likely to not include an optimal policy πc\pi_c^*; optimistic search for an optimal policy will not be enough to find it. Safe exploration techniques solve this challenge with a reward-free pure exploration phase that effectively expands the set of feasible policies.

Exploration-exploitation vs. Expansion
Left: Exploration-exploitation in constrained tasks may not find an optimal policy because search is limited to a set of policies that are known to be safe (in purple). Right: expansion proactively enlarges the safe set and reaches to optimum.

While pure exploration allows the agent to sufficiently expand the set of feasible policies, performance during this phase can be arbitrarily bad because the learner completely ignores its objective, which specified by the rewards. SOOPER addresses this via intrinsic rewards, namely, rewards that guide the learner to systematically explore even in the absence of (extrinsic) rewards from the real environment. This can be done by solving at each iteration nn

πn=argmaxπEp~,π,s0[t=0γtr~(st,at)+(γtλexplore+γtλexpand)σn(st,at)] \pi_n = \arg\max_\pi \mathbb{E}_{\tilde{p}, \pi, s_0}\left[\sum_{t=0}^\infty \gamma^t\tilde{r}(s_t,a_t) + (\gamma^t\lambda_{\text{explore}}+\sqrt{\gamma^t}\lambda_{\text{expand}})\|\sigma_n(s_t,a_t)\|\right]

using model-generated rollouts, where σn\sigma_n denotes the learner’s uncertainty at a given state-action. The main takeaway is that intrinsic rewards allow us to do both exploration and expansion with a single objective, therefore there is no need for reward-free pure exploration.

But Does it Work?

Yes

We test SOOPER on a five different simulated reinforcement learning environments, comparing it with state-of-the-art safe exploration algorithms. In all of our experiments, we first train a pessimistic prior policy π^\hat\pi under shifted dynamics. This pessimistic prior policy is known to be safe but due to pessimism, it underperforms when deployed on the true dynamics.

Experiment results across five tasks. SOOPER satisfies the constraints in all tasks while being on par or outperforming the baselines.
Experiment results across five tasks. The upper plot shows the performance improvement over the prior policy. The lower plot shows the largest recorded accumulated cost during training. SOOPER satisfies the constraints in all tasks while being on par or outperforming the baselines.

Vision Control

We show that SOOPER can handle control tasks from image observations. Using a CartpoleSwingup setup, the method learns directly from visual embeddings and consistently meets safety constraints while achieving near-optimal performance, even under changes in the environment. As shown, SOOPER (in purple) is the only baseline that satisfies the constraint throughout learning.

Performance of SOOPER compared to an unsafe baseline.
Visualization of the task.
The task is to swing up the pendulum while keeping the cart within the red region. The agent observes only three temporally-stacked grayscale images.

Hardware

We train a policy using first-principles model of a remote-control racing car that operates at 60 Hz. We then continue training online that policy with SOOPER on real remove control race car using SOOPER.

Visualization of the task.
Performance of SOOPER compared to an unsafe baseline.
Safe exploration on real hardware with SOOPER. We report the mean and standard error across five seeds of the objective and constraint measured on the real system. SOOPER learns to improve over the prior policy while satisfying the constraints throughout learning.
SOOPER significantly improves over the prior policy while maintaining safety throughout learning.

This experiment is repeated later also by obtaining a prior policy that was trained using expert offline data from the real system.

Learn More

Check out our open-source implementation for more details. Stay tuned for upcoming paper submissions!

Cite

  @inproceedings{
wendl2025safe,
title={Safe Exploration via Policy Priors},
author={Manuel Wendl and Yarden As and Manish Prajapat and Anton Pollak and Stelian Coros and Andreas Krause},
booktitle={NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists},
year={2025},
url={https://openreview.net/forum?id=YKCMhLC4Vs}
}