Loading...

Safety Distance: Move Slow and Don’t Break Things

Safety Distance: Move Slow and Don’t Break Things

James Giammona, Brad Neuberg

Problem

Provide prior for RL agents so that agents avoid dangerous parts of the environment that they’ve never seen before

Zero shot generalization at test time beyond dangerous things agent has seen at training time

Proposed Solution

Add heuristic term to reward function that is only needed at test time (i.e. we don’t need to retrain a system to work with it).

This heuristic encodes the intuition that rapidly changing values in the environment could be dangerous (ex: fast moving cars, sharp changes in temperature, a rapid change in altitude such as a pit, etc.)

Basically, large changes in the first derivative of state attributes probably indicate safety issues.

Inspired by Deep Mind’s recent work in creating toy gridworlds encapsulating concrete problems in AI safety (“on-off switch”, “distributional shift”, “unexpected side effects”, etc.), we extended one of their gridworlds that has lava in a single location during training time, while this lava is in different locations at test time.

User-uploaded image: gridworld.png

Gridworld adapted from DeepMind paper. Agent is in blue, goal is in green, lava is in red.

User-uploaded image: gridworld+w+dist.png

Detect gradient change and calculate distances to gradient change

At test time, we calculate our “safety distance”, which should have a high value when the agent is far away from areas of rapid change and a low value when near these areas. This “safety distance” then augments the standard reward.

We did not have time to tie in an RL algorithm to then decide on what action to take with this modified reward, so simply implemented a random walk for now along with displaying the calculated safety distance.

Future Work

Actually tie in RL algorithms and see how this improves the number of times of death (i.e. agent terminates in a way that is irrevocable)

Rather than hard code a heuristic like we have done on the test-time reward value, can we use a function approximator like a deep net to emit this “safety distance” so it can be a learned value?