2026-03-24agentsalignmentscalingdata

Safe Reinforcement Learning with Preference-based Constraint Inference

Chenglin Li, Grant Ruan, Hua Geng

Key claim

PbCRL improves safety constraint inference in reinforcement learning.

This study presents a new approach called Preference-based Constrained Reinforcement Learning (PbCRL) that effectively infers safety constraints from human preferences. A key result is that PbCRL achieves better alignment with true safety requirements while outperforming existing methods in both safety and reward metrics.

Novelty

8.0/10

The introduction of a dead zone mechanism and SNR loss in preference modeling is a meaningful extension to existing constraint inference methods.

Reliability

7.5/10

The methodology is solid with empirical results demonstrating superior performance against baselines.

Deep reliability assessment

The methodology supports the claim that PbCRL can learn safety constraints from preferences and align them with true safety requirements, but the assumption that human feedback is noise-free may be overclaimed.

Reproducibility

Yes, the paper mentions using code from RLSF and Safe RLHF, but does not provide a direct link to a specific repository for PbCRL.

Discussion questions

How does the assumption of noise-free human feedback affect the applicability of PbCRL in real-world scenarios?
What are the practical challenges in collecting preference data for training PbCRL in a new domain?
What specific conditions or scenarios would demonstrate that PbCRL fails to align learned constraints with true safety requirements?

Key figure

Figure 1 illustrates the difference between ground truth, BT model, and the proposed model's cost distributions, highlighting the heavy-tailed nature of true costs.

Benchmark results

Safety GymnasiumReturn: 23.41vs PPO-BT+0.85SOTA

Blocked RoadCost: 0.07vs PPO-BT-0.05SOTA

Read on arXiv →