Designing Degraded Modes for AI-Enabled Robots

Designing Degraded Modes for AI-Enabled Robots

A robot does not become safe because it can stop.

It becomes safer when it knows what capability to keep, reduce, or remove before a stop becomes the only option.

That is the purpose of degraded modes. An AI-enabled robot should not have only two states: “autonomous” and “dead.” Real machines need intermediate operating modes where autonomy is reduced, speed is limited, sensors are reweighted, operators regain authority, and risky functions are disabled while the system preserves the safest useful behavior.

This matters even more when the robot uses LLMs, vision-language models, learned perception, local agents, or tool-calling interfaces. AI components can improve task reasoning and interaction, but they also add uncertainty: stale context, low confidence detections, unavailable GPU memory, delayed inference, planner disagreement, ambiguous user intent, and tool outputs that look valid but are not physically trustworthy.

The engineering question is not:

How do we keep the robot running at all costs?

It is:

What is the smallest safe capability the robot can still provide when part of the stack becomes unreliable?

Key takeaways

  • A degraded mode is a defined operating state with reduced capability, reduced authority, stricter limits, and explicit recovery conditions.
  • Degraded modes should be designed before incidents, not invented by an LLM, operator, or exception handler at runtime.
  • AI-enabled robots need degraded modes for perception loss, localization drift, network loss, GPU overload, model uncertainty, stale state, actuator faults, and operator-control handoff.
  • The safest degraded mode is often not an immediate E-stop. It may be low-speed teleoperation, hold-position, local-only navigation, perception-limited inspection, or return-to-base under tighter constraints.
  • ROS 2 lifecycle state, diagnostics, watchdogs, command validation, and runtime assurance monitors should feed the mode manager.
  • Every degraded mode needs entry triggers, allowed commands, disallowed commands, timing limits, exit criteria, logging, and a fallback if recovery fails.

Citation-ready answer

A degraded mode for an AI-enabled robot is a predefined operating state that reduces autonomy, speed, workspace, command authority, or task scope when part of the robot stack becomes unreliable. Instead of letting a failing perception model, planner, network, GPU, sensor, or actuator continue with full authority, the supervisor moves the robot into a safer capability envelope such as low-speed operation, local-only control, hold-position, manual assist, return-to-base, or safe stop. Degraded modes are not error messages; they are executable safety architecture.

Degraded mode is not failure handling

Failure handling often means reacting after something breaks.

Degraded-mode design is more deliberate. It defines what the robot is still allowed to do when confidence has dropped but the system is not yet in an emergency.

That distinction matters:

ConditionBad reactionBetter degraded-mode response
Camera confidence dropsKeep running nominal autonomyReduce speed, require LiDAR/encoder agreement, disable manipulation
Localization grows staleContinue navigation goalStop path execution, rotate in place only if safe, request relocalization
LLM tool output is ambiguousAsk the model to retry until it worksReject command, ask for operator clarification, keep current safe state
GPU inference latency spikesQueue more AI workShed noncritical models, keep watchdogs and local control alive
Motor driver reports thermal warningIgnore until faultReduce torque/speed, shorten task horizon, return to service zone
Network link dropsWait in blocking stateSwitch to local autonomy or hold-position policy

The degraded mode is the state between “everything is fine” and “cut power.” It is where a lot of real robot reliability is won.

This expands the safety stack I described in robot safety architecture: watchdogs and E-stops matter, but the system also needs designed behavior for partial loss of trust.

The mode ladder

For most AI-enabled robots, I would start with a mode ladder like this:

ModeAutonomySpeed/forceAI authorityTypical use
NormalFull approved autonomyNominal limitsAI may propose bounded tasksHealthy sensors, fresh state, stable compute
CautiousFull task class, tighter envelopeReduced limitsAI may propose, validator is stricterMild uncertainty or crowded workspace
DegradedReduced task classLow limitsAI may explain, but cannot start risky actionsSensor loss, latency spike, subsystem warning
AssistedOperator-supervised executionLow/manual limitsAI becomes advisory onlyRecovery, inspection, maintenance handoff
HoldNo task progressZero motion or brake/holdAI may summarize onlyState stale, unclear scene, short recovery window
Safe stopNo autonomyHazardous energy removed or controlledNo physical authorityEmergency, interlock, unrecoverable fault

The exact names do not matter. The authority boundary does.

An LLM should not be able to move the robot from safe stop back to normal. A VLM should not be able to override low localization confidence. A remote operator should not be able to resume nominal autonomy until the required preconditions are true. Mode transitions are supervisory decisions, not chat interactions.

That is why degraded modes belong next to command validation for AI-enabled robots. Command validation decides whether a specific request is allowed. Mode management decides which command classes are allowed at all.

What should trigger degradation

A degraded mode should be triggered by robot evidence, not vibes.

Good triggers are measurable:

Trigger familyExample signalDegraded response
State freshness/odom age exceeds 100 msStop navigation commands, request fresh state
Perception confidenceObject detector confidence below thresholdDisable grasping, allow only observation
Sensor disagreementIMU, wheel odometry, and visual odometry divergeReduce speed, relocalize, reject long path plans
Compute pressureGPU memory or inference latency exceeds budgetDisable nonessential AI, keep safety monitors alive
Lifecycle statePlanner or controller node inactivePause autonomy, restart stack or require operator
DiagnosticsDriver warning, thermal warning, voltage dropReduce load, enter service-return policy
Communication lossBase to MCU heartbeat timeoutStop commands, let MCU enforce local hold/stop
Human proximitySafety scanner sees person inside warning zoneReduce speed or stop depending on zone
Model uncertaintyLLM command fails validation repeatedlyLock command class, require human confirmation

The key is that each trigger must map to a deterministic mode transition. If the system only emits logs and hopes someone notices, it does not have degraded modes. It has observability without supervision.

Runtime assurance as the mental model

Runtime assurance is a useful way to think about degraded modes.

NASA’s 2024 paper A Verification Framework for Runtime Assurance of Autonomous UAS describes a monitor that can hand control from an advanced controller to a trusted reversionary controller when a safety property is violated. That is a strong mental model for Physical AI systems: the advanced AI layer can be useful, but the robot needs a simpler trusted behavior ready when the advanced behavior leaves its safe envelope.

In robotics terms:

Runtime assurance conceptRobot implementation
Advanced controllerAI planner, learned perception, VLA policy, behavior tree
Runtime monitorSupervisor checking freshness, confidence, timing, limits
Safety propertySpeed, workspace, separation, thermal, localization, force limit
Reversionary controllerHold, low-speed teleop, return-to-base, safe stop
Switch conditionMeasurable trigger that removes authority from the advanced layer

The important word is “trusted.” A degraded mode should use simpler, better-understood behavior than the mode it replaces. If nominal autonomy depends on a VLM and dense semantic mapping, the degraded mode might use lidar-only obstacle stop, low-speed teleop, or a prevalidated route. It should not depend on the same failing perception chain.

ROS 2 lifecycle state belongs in the mode manager

ROS 2 lifecycle nodes are useful because they expose whether components are configured, inactive, active, cleaning up, shutting down, or in error handling. The ROS 2 managed-node design describes lifecycle nodes as components with a known state machine and supervisory transitions.

That state should feed degraded-mode logic.

For example:

Lifecycle observationMode decision
planner_server inactiveNo new navigation goals
controller_server inactiveNo path execution
Camera driver active but diagnostics warnAllow navigation, disable visual grasping
Localization node in error processingHold, then relocalize or require operator
MCU bridge not activeReject actuator commands, keep AI advisory only

Nav2’s Lifecycle Manager is a practical reference. Its documentation describes deterministic transition of ordered nodes and bond connections that can bring nodes down if a server crashes or becomes non-responsive. That is the kind of operational discipline a degraded-mode manager needs: ordered transitions, timeouts, recovery attempts, and explicit manual reactivation when automatic recovery is no longer safe.

Diagnostics should be mode inputs, not dashboards only

ROS diagnostics are often treated as something a developer checks in a UI after a problem.

For degraded modes, diagnostics should be machine inputs.

The ROS 2 diagnostic_updater package supports collecting and publishing diagnostic messages, and diagnosed publishers can report frequency behavior. That matters because degraded modes often depend on the difference between:

  • a node is alive,
  • a node is publishing at the expected rate,
  • a device is publishing stale data,
  • a device is publishing fresh data with warning status,
  • a device is fresh but physically unreliable.

A simple mode manager can subscribe to health topics, diagnostics, lifecycle state, watchdog state, and command-validation events. It should not scrape logs or ask an LLM to infer health from text.

This connects directly to structuring ROS 2 logs and rosbags for AI-assisted debugging: every mode transition should be recorded with trigger, timestamp, robot state reference, previous mode, new mode, and recovery decision.

A degraded-mode decision matrix

Here is a concrete matrix for a small mobile manipulator.

Fault or uncertaintyNormal capability removedRemaining capabilityHard fallback
VLM confidence lowSemantic manipulationManual camera view, non-contact inspectionHold
LLM command rejected repeatedlyAI task initiationStatus explanation onlyRequire operator
Visual odometry staleAutonomous navigationLow-speed base hold, wheel/IMU monitoringSafe stop if drift grows
LiDAR unavailableNavigation through shared spaceStationary diagnostics, local arm lockoutSafe stop
Arm joint encoder warningManipulationBase return if arm stowed and lockedBrake/hold arm
MCU heartbeat lostActuation from ROS 2AI advisory, diagnostics onlyMCU local stop
GPU memory pressureHeavy perception and LLM inferenceSafety monitor, low-rate diagnosticsStop AI tasks
Battery voltage sagLong mission tasksReturn-to-base or parkPower-safe shutdown
Human in warning zoneNominal speedLow speed or pauseStop if protection zone entered

This table is more useful than a generic “fallback plan” because it separates what is removed from what remains. That prevents a common mistake: treating degradation as a total system failure even when a safe reduced function is available.

Define mode contracts

Every degraded mode should have a contract.

Contract fieldExample
Mode nameDEGRADED_LOCALIZATION
Entry triggersOdometry age > 100 ms, localization confidence < 0.65
Allowed commandsStop, hold, relocalize, low-speed manual jog
Blocked commandsNavigate to goal, autonomous docking, arm motion near people
Speed/force envelopeBase max 0.05 m/s, arm locked
Required monitorsSafety scanner, MCU heartbeat, battery state
Exit criteriaLocalization confidence > 0.85 for 5 seconds and odom age < 50 ms
Timeout30 seconds before hold or operator request
Loggingmode_transition, trigger enum, confidence values, command IDs
Human roleOperator can approve relocalization, cannot bypass E-stop

The exit criteria are as important as the entry criteria. A robot that bounces between normal and degraded mode every second is not safe or useful. Add hysteresis: enter degraded mode quickly, but require stable evidence before returning to normal.

AI authority should shrink as confidence falls

A clean degraded-mode design changes what the AI is allowed to do.

ModeLLM/tool authority
NormalMay propose bounded tasks through command validation
CautiousMay propose only low-risk tasks; stricter confirmation rules
DegradedMay explain status and propose recovery checklists; no new risky actions
AssistedMay guide operator, summarize diagnostics, prepare commands for approval
HoldMay explain why motion is blocked
Safe stopNo physical authority; status explanation only

This is where degraded modes enforce the point from how to split authority between an LLM, ROS 2, and a microcontroller. As decisions get closer to actuation, timing, and safety, the AI should lose authority, not gain it.

Timing matters

A degraded-mode transition has a timing budget.

If a safety scanner heartbeat is missing, you might have tens of milliseconds before the MCU or safety relay enforces stop behavior. If localization confidence is drifting, you may have hundreds of milliseconds or seconds to slow down, hold, and relocalize. If GPU memory pressure affects a noncritical object classifier, you may have seconds to shed load without affecting motion.

Useful timing fields:

FieldWhy it matters
Detection deadlineHow quickly the trigger must be detected
Decision deadlineHow quickly the mode manager must decide
Actuation deadlineHow quickly the controller must reduce or stop motion
Recovery windowHow long automatic recovery may run
Escalation timeoutWhen to require operator or safe stop
Stable-exit windowHow long evidence must remain healthy before resuming

If a degraded mode cannot meet its timing budget, it is not a degraded mode. It is a wish.

This is why low-level timing-critical reactions should stay below the AI stack, as discussed in real-time Linux for robotics. The AI can help explain and plan recovery. It should not be in the critical stop path.

Implementation pattern

The simplest implementation is a ROS 2 mode manager node with a small state machine.

It consumes:

  • lifecycle state,
  • diagnostic status,
  • watchdog heartbeats,
  • command-validation decisions,
  • sensor freshness metrics,
  • confidence scores,
  • safety scanner zones,
  • controller fault codes,
  • operator mode requests.

It publishes:

  • current operating mode,
  • allowed command classes,
  • active speed/force/workspace envelope,
  • transition reason,
  • recovery status,
  • audit events.

The core loop is ordinary software:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def evaluate_mode(state, current_mode, policy):
triggers = collect_triggers(state, policy)

if triggers.has("estop_active") or triggers.has("safety_relay_open"):
return transition("SAFE_STOP", "hard_safety_trigger")

if triggers.has("mcu_heartbeat_lost"):
return transition("HOLD", "mcu_heartbeat_lost")

if triggers.has("localization_stale"):
return transition("DEGRADED_LOCALIZATION", "localization_stale")

if triggers.has("gpu_latency_high"):
return transition("DEGRADED_COMPUTE", "gpu_latency_high")

if current_mode != "NORMAL" and recovery_conditions_are_stable(state, policy):
return transition(next_recovery_mode(current_mode), "stable_recovery")

return stay(current_mode)

The policy should be explicit enough to test without the LLM running. You should be able to replay a rosbag, feed the mode manager the same state, and get the same transitions.

Test degraded modes before trusting them

A degraded mode that has never been tested is just documentation.

Minimum tests:

  • Kill a perception node while the robot is moving slowly.
  • Delay or drop odometry messages.
  • Inject stale TF.
  • Simulate GPU latency spikes.
  • Drop the MCU heartbeat.
  • Force low battery and thermal warning states.
  • Reject several LLM commands in a row.
  • Trigger lifecycle node error handling.
  • Replay the same bag and verify identical transitions.
  • Verify that the robot does not automatically return to normal without stable exit criteria.

The test output should include mode transition logs, command rejection logs, watchdog events, and final recovery state. If you cannot explain why the robot entered a mode, the mode manager is not observable enough.

Common mistakes

The first mistake is treating degraded mode as “slow mode.” Reduced speed is useful, but it is not enough. Some failures require removing entire command classes, not merely slowing them down.

Other common mistakes:

  • letting the LLM decide whether the robot is degraded,
  • returning to normal after one healthy sample,
  • using one generic degraded state for every fault,
  • failing open when diagnostics are unavailable,
  • allowing AI tool calls during safe stop,
  • logging only the final failure and not the first trigger,
  • depending on the same failing sensor in the fallback mode,
  • making teleoperation the fallback without checking communication quality,
  • giving the operator a bypass that is not logged,
  • designing mode transitions without timing budgets.

The hardest mistake to see in advance is shared dependency. If normal autonomy and degraded autonomy both require the same GPU-heavy perception model, then degraded mode may fail exactly when you need it.

FAQ

Is degraded mode the same as safe stop?

No. Safe stop removes or controls hazardous behavior when continued operation is not acceptable. A degraded mode preserves a smaller safe capability before safe stop is necessary.

Should an AI agent be allowed to choose degraded mode?

It may recommend or explain a mode, but the actual transition should come from deterministic supervisor logic based on robot state, diagnostics, and policy.

What is the first degraded mode to implement?

Start with stale localization or perception-confidence degradation. These failures are common, measurable, and directly tied to unsafe autonomy.

Does every robot need many degraded modes?

No. A small robot may need only normal, low-speed assisted, hold, and safe stop. A field robot, mobile manipulator, or human-adjacent system usually needs more specific modes.

How does this relate to command validation?

Command validation checks a proposed command. Mode management controls which command classes are allowed under the current robot condition. They should work together.

What should be logged?

Log the previous mode, new mode, trigger enum, relevant sensor ages/confidence values, command IDs blocked by the transition, watchdog state, operator ID if involved, and recovery result.

Final opinion

The real test of an AI-enabled robot is not how impressive it looks in normal mode.

It is what it does when confidence falls.

Good degraded modes make the robot less magical and more trustworthy. They remove authority from uncertain AI components, preserve the simplest safe capability, keep humans informed, and make every transition auditable. That is the kind of engineering boundary Physical AI needs before it should be trusted around real machines.