Designing Degraded Modes for AI-Enabled Robots

A robot does not become safe because it can stop.

It becomes safer when it knows what capability to keep, reduce, or remove before a stop becomes the only option.

That is the purpose of degraded modes. An AI-enabled robot should not have only two states: “autonomous” and “dead.” Real machines need intermediate operating modes where autonomy is reduced, speed is limited, sensors are reweighted, operators regain authority, and risky functions are disabled while the system preserves the safest useful behavior.

This matters even more when the robot uses LLMs, vision-language models, learned perception, local agents, or tool-calling interfaces. AI components can improve task reasoning and interaction, but they also add uncertainty: stale context, low confidence detections, unavailable GPU memory, delayed inference, planner disagreement, ambiguous user intent, and tool outputs that look valid but are not physically trustworthy.

The engineering question is not:

How do we keep the robot running at all costs?

It is:

What is the smallest safe capability the robot can still provide when part of the stack becomes unreliable?

Key takeaways

A degraded mode is a defined operating state with reduced capability, reduced authority, stricter limits, and explicit recovery conditions.
Degraded modes should be designed before incidents, not invented by an LLM, operator, or exception handler at runtime.
AI-enabled robots need degraded modes for perception loss, localization drift, network loss, GPU overload, model uncertainty, stale state, actuator faults, and operator-control handoff.
The safest degraded mode is often not an immediate E-stop. It may be low-speed teleoperation, hold-position, local-only navigation, perception-limited inspection, or return-to-base under tighter constraints.
ROS 2 lifecycle state, diagnostics, watchdogs, command validation, and runtime assurance monitors should feed the mode manager.
Every degraded mode needs entry triggers, allowed commands, disallowed commands, timing limits, exit criteria, logging, and a fallback if recovery fails.

Citation-ready answer

A degraded mode for an AI-enabled robot is a predefined operating state that reduces autonomy, speed, workspace, command authority, or task scope when part of the robot stack becomes unreliable. Instead of letting a failing perception model, planner, network, GPU, sensor, or actuator continue with full authority, the supervisor moves the robot into a safer capability envelope such as low-speed operation, local-only control, hold-position, manual assist, return-to-base, or safe stop. Degraded modes are not error messages; they are executable safety architecture.

Degraded mode is not failure handling

Failure handling often means reacting after something breaks.

Degraded-mode design is more deliberate. It defines what the robot is still allowed to do when confidence has dropped but the system is not yet in an emergency.

That distinction matters:

Condition	Bad reaction	Better degraded-mode response
Camera confidence drops	Keep running nominal autonomy	Reduce speed, require LiDAR/encoder agreement, disable manipulation
Localization grows stale	Continue navigation goal	Stop path execution, rotate in place only if safe, request relocalization
LLM tool output is ambiguous	Ask the model to retry until it works	Reject command, ask for operator clarification, keep current safe state
GPU inference latency spikes	Queue more AI work	Shed noncritical models, keep watchdogs and local control alive
Motor driver reports thermal warning	Ignore until fault	Reduce torque/speed, shorten task horizon, return to service zone
Network link drops	Wait in blocking state	Switch to local autonomy or hold-position policy

The degraded mode is the state between “everything is fine” and “cut power.” It is where a lot of real robot reliability is won.

This expands the safety stack I described in robot safety architecture: watchdogs and E-stops matter, but the system also needs designed behavior for partial loss of trust.

The mode ladder

For most AI-enabled robots, I would start with a mode ladder like this:

Mode	Autonomy	Speed/force	AI authority	Typical use
Normal	Full approved autonomy	Nominal limits	AI may propose bounded tasks	Healthy sensors, fresh state, stable compute
Cautious	Full task class, tighter envelope	Reduced limits	AI may propose, validator is stricter	Mild uncertainty or crowded workspace
Degraded	Reduced task class	Low limits	AI may explain, but cannot start risky actions	Sensor loss, latency spike, subsystem warning
Assisted	Operator-supervised execution	Low/manual limits	AI becomes advisory only	Recovery, inspection, maintenance handoff
Hold	No task progress	Zero motion or brake/hold	AI may summarize only	State stale, unclear scene, short recovery window
Safe stop	No autonomy	Hazardous energy removed or controlled	No physical authority	Emergency, interlock, unrecoverable fault

The exact names do not matter. The authority boundary does.

An LLM should not be able to move the robot from safe stop back to normal. A VLM should not be able to override low localization confidence. A remote operator should not be able to resume nominal autonomy until the required preconditions are true. Mode transitions are supervisory decisions, not chat interactions.

That is why degraded modes belong next to command validation for AI-enabled robots. Command validation decides whether a specific request is allowed. Mode management decides which command classes are allowed at all.

What should trigger degradation

A degraded mode should be triggered by robot evidence, not vibes.

Good triggers are measurable:

Trigger family	Example signal	Degraded response
State freshness	`/odom` age exceeds 100 ms	Stop navigation commands, request fresh state
Perception confidence	Object detector confidence below threshold	Disable grasping, allow only observation
Sensor disagreement	IMU, wheel odometry, and visual odometry diverge	Reduce speed, relocalize, reject long path plans
Compute pressure	GPU memory or inference latency exceeds budget	Disable nonessential AI, keep safety monitors alive
Lifecycle state	Planner or controller node inactive	Pause autonomy, restart stack or require operator
Diagnostics	Driver warning, thermal warning, voltage drop	Reduce load, enter service-return policy
Communication loss	Base to MCU heartbeat timeout	Stop commands, let MCU enforce local hold/stop
Human proximity	Safety scanner sees person inside warning zone	Reduce speed or stop depending on zone
Model uncertainty	LLM command fails validation repeatedly	Lock command class, require human confirmation

The key is that each trigger must map to a deterministic mode transition. If the system only emits logs and hopes someone notices, it does not have degraded modes. It has observability without supervision.

Runtime assurance as the mental model

Runtime assurance is a useful way to think about degraded modes.

NASA’s 2024 paper A Verification Framework for Runtime Assurance of Autonomous UAS describes a monitor that can hand control from an advanced controller to a trusted reversionary controller when a safety property is violated. That is a strong mental model for Physical AI systems: the advanced AI layer can be useful, but the robot needs a simpler trusted behavior ready when the advanced behavior leaves its safe envelope.

In robotics terms:

Runtime assurance concept	Robot implementation
Advanced controller	AI planner, learned perception, VLA policy, behavior tree
Runtime monitor	Supervisor checking freshness, confidence, timing, limits
Safety property	Speed, workspace, separation, thermal, localization, force limit
Reversionary controller	Hold, low-speed teleop, return-to-base, safe stop
Switch condition	Measurable trigger that removes authority from the advanced layer

The important word is “trusted.” A degraded mode should use simpler, better-understood behavior than the mode it replaces. If nominal autonomy depends on a VLM and dense semantic mapping, the degraded mode might use lidar-only obstacle stop, low-speed teleop, or a prevalidated route. It should not depend on the same failing perception chain.

ROS 2 lifecycle state belongs in the mode manager

ROS 2 lifecycle nodes are useful because they expose whether components are configured, inactive, active, cleaning up, shutting down, or in error handling. The ROS 2 managed-node design describes lifecycle nodes as components with a known state machine and supervisory transitions.

That state should feed degraded-mode logic.

For example:

Lifecycle observation	Mode decision
`planner_server` inactive	No new navigation goals
`controller_server` inactive	No path execution
Camera driver active but diagnostics warn	Allow navigation, disable visual grasping
Localization node in error processing	Hold, then relocalize or require operator
MCU bridge not active	Reject actuator commands, keep AI advisory only

Nav2’s Lifecycle Manager is a practical reference. Its documentation describes deterministic transition of ordered nodes and bond connections that can bring nodes down if a server crashes or becomes non-responsive. That is the kind of operational discipline a degraded-mode manager needs: ordered transitions, timeouts, recovery attempts, and explicit manual reactivation when automatic recovery is no longer safe.

Diagnostics should be mode inputs, not dashboards only

ROS diagnostics are often treated as something a developer checks in a UI after a problem.

For degraded modes, diagnostics should be machine inputs.

The ROS 2 diagnostic_updater package supports collecting and publishing diagnostic messages, and diagnosed publishers can report frequency behavior. That matters because degraded modes often depend on the difference between:

a node is alive,
a node is publishing at the expected rate,
a device is publishing stale data,
a device is publishing fresh data with warning status,
a device is fresh but physically unreliable.

A simple mode manager can subscribe to health topics, diagnostics, lifecycle state, watchdog state, and command-validation events. It should not scrape logs or ask an LLM to infer health from text.

This connects directly to structuring ROS 2 logs and rosbags for AI-assisted debugging: every mode transition should be recorded with trigger, timestamp, robot state reference, previous mode, new mode, and recovery decision.

A degraded-mode decision matrix

Here is a concrete matrix for a small mobile manipulator.

Fault or uncertainty	Normal capability removed	Remaining capability	Hard fallback
VLM confidence low	Semantic manipulation	Manual camera view, non-contact inspection	Hold
LLM command rejected repeatedly	AI task initiation	Status explanation only	Require operator
Visual odometry stale	Autonomous navigation	Low-speed base hold, wheel/IMU monitoring	Safe stop if drift grows
LiDAR unavailable	Navigation through shared space	Stationary diagnostics, local arm lockout	Safe stop
Arm joint encoder warning	Manipulation	Base return if arm stowed and locked	Brake/hold arm
MCU heartbeat lost	Actuation from ROS 2	AI advisory, diagnostics only	MCU local stop
GPU memory pressure	Heavy perception and LLM inference	Safety monitor, low-rate diagnostics	Stop AI tasks
Battery voltage sag	Long mission tasks	Return-to-base or park	Power-safe shutdown
Human in warning zone	Nominal speed	Low speed or pause	Stop if protection zone entered

This table is more useful than a generic “fallback plan” because it separates what is removed from what remains. That prevents a common mistake: treating degradation as a total system failure even when a safe reduced function is available.

Define mode contracts

Every degraded mode should have a contract.

Contract field	Example
Mode name	`DEGRADED_LOCALIZATION`
Entry triggers	Odometry age > 100 ms, localization confidence < 0.65
Allowed commands	Stop, hold, relocalize, low-speed manual jog
Blocked commands	Navigate to goal, autonomous docking, arm motion near people
Speed/force envelope	Base max 0.05 m/s, arm locked
Required monitors	Safety scanner, MCU heartbeat, battery state
Exit criteria	Localization confidence > 0.85 for 5 seconds and odom age < 50 ms
Timeout	30 seconds before hold or operator request
Logging	`mode_transition`, trigger enum, confidence values, command IDs
Human role	Operator can approve relocalization, cannot bypass E-stop

The exit criteria are as important as the entry criteria. A robot that bounces between normal and degraded mode every second is not safe or useful. Add hysteresis: enter degraded mode quickly, but require stable evidence before returning to normal.

AI authority should shrink as confidence falls

A clean degraded-mode design changes what the AI is allowed to do.

Mode	LLM/tool authority
Normal	May propose bounded tasks through command validation
Cautious	May propose only low-risk tasks; stricter confirmation rules
Degraded	May explain status and propose recovery checklists; no new risky actions
Assisted	May guide operator, summarize diagnostics, prepare commands for approval
Hold	May explain why motion is blocked
Safe stop	No physical authority; status explanation only

This is where degraded modes enforce the point from how to split authority between an LLM, ROS 2, and a microcontroller. As decisions get closer to actuation, timing, and safety, the AI should lose authority, not gain it.

Timing matters

A degraded-mode transition has a timing budget.

If a safety scanner heartbeat is missing, you might have tens of milliseconds before the MCU or safety relay enforces stop behavior. If localization confidence is drifting, you may have hundreds of milliseconds or seconds to slow down, hold, and relocalize. If GPU memory pressure affects a noncritical object classifier, you may have seconds to shed load without affecting motion.

Useful timing fields:

Field	Why it matters
Detection deadline	How quickly the trigger must be detected
Decision deadline	How quickly the mode manager must decide
Actuation deadline	How quickly the controller must reduce or stop motion
Recovery window	How long automatic recovery may run
Escalation timeout	When to require operator or safe stop
Stable-exit window	How long evidence must remain healthy before resuming

If a degraded mode cannot meet its timing budget, it is not a degraded mode. It is a wish.

This is why low-level timing-critical reactions should stay below the AI stack, as discussed in real-time Linux for robotics. The AI can help explain and plan recovery. It should not be in the critical stop path.

Implementation pattern

The simplest implementation is a ROS 2 mode manager node with a small state machine.

It consumes:

lifecycle state,
diagnostic status,
watchdog heartbeats,
command-validation decisions,
sensor freshness metrics,
confidence scores,
safety scanner zones,
controller fault codes,
operator mode requests.

It publishes:

current operating mode,
allowed command classes,
active speed/force/workspace envelope,
transition reason,
recovery status,
audit events.

The core loop is ordinary software:

def evaluate_mode(state, current_mode, policy):
    triggers = collect_triggers(state, policy)

    if triggers.has("estop_active") or triggers.has("safety_relay_open"):
        return transition("SAFE_STOP", "hard_safety_trigger")

    if triggers.has("mcu_heartbeat_lost"):
        return transition("HOLD", "mcu_heartbeat_lost")

    if triggers.has("localization_stale"):
        return transition("DEGRADED_LOCALIZATION", "localization_stale")

    if triggers.has("gpu_latency_high"):
        return transition("DEGRADED_COMPUTE", "gpu_latency_high")

    if current_mode != "NORMAL" and recovery_conditions_are_stable(state, policy):
        return transition(next_recovery_mode(current_mode), "stable_recovery")

    return stay(current_mode)

The policy should be explicit enough to test without the LLM running. You should be able to replay a rosbag, feed the mode manager the same state, and get the same transitions.

Test degraded modes before trusting them

A degraded mode that has never been tested is just documentation.

Minimum tests:

Kill a perception node while the robot is moving slowly.
Delay or drop odometry messages.
Inject stale TF.
Simulate GPU latency spikes.
Drop the MCU heartbeat.
Force low battery and thermal warning states.
Reject several LLM commands in a row.
Trigger lifecycle node error handling.
Replay the same bag and verify identical transitions.
Verify that the robot does not automatically return to normal without stable exit criteria.

The test output should include mode transition logs, command rejection logs, watchdog events, and final recovery state. If you cannot explain why the robot entered a mode, the mode manager is not observable enough.

Common mistakes

The first mistake is treating degraded mode as “slow mode.” Reduced speed is useful, but it is not enough. Some failures require removing entire command classes, not merely slowing them down.

Other common mistakes:

letting the LLM decide whether the robot is degraded,
returning to normal after one healthy sample,
using one generic degraded state for every fault,
failing open when diagnostics are unavailable,
allowing AI tool calls during safe stop,
logging only the final failure and not the first trigger,
depending on the same failing sensor in the fallback mode,
making teleoperation the fallback without checking communication quality,
giving the operator a bypass that is not logged,
designing mode transitions without timing budgets.

The hardest mistake to see in advance is shared dependency. If normal autonomy and degraded autonomy both require the same GPU-heavy perception model, then degraded mode may fail exactly when you need it.

FAQ

Is degraded mode the same as safe stop?

No. Safe stop removes or controls hazardous behavior when continued operation is not acceptable. A degraded mode preserves a smaller safe capability before safe stop is necessary.

Should an AI agent be allowed to choose degraded mode?

It may recommend or explain a mode, but the actual transition should come from deterministic supervisor logic based on robot state, diagnostics, and policy.

What is the first degraded mode to implement?

Start with stale localization or perception-confidence degradation. These failures are common, measurable, and directly tied to unsafe autonomy.

Does every robot need many degraded modes?

No. A small robot may need only normal, low-speed assisted, hold, and safe stop. A field robot, mobile manipulator, or human-adjacent system usually needs more specific modes.

How does this relate to command validation?

Command validation checks a proposed command. Mode management controls which command classes are allowed under the current robot condition. They should work together.

What should be logged?

Log the previous mode, new mode, trigger enum, relevant sensor ages/confidence values, command IDs blocked by the transition, watchdog state, operator ID if involved, and recovery result.

Final opinion

The real test of an AI-enabled robot is not how impressive it looks in normal mode.

It is what it does when confidence falls.

Good degraded modes make the robot less magical and more trustworthy. They remove authority from uncertain AI components, preserve the simplest safe capability, keep humans informed, and make every transition auditable. That is the kind of engineering boundary Physical AI needs before it should be trusted around real machines.