What VLA Models Still Cannot Do Safely in Robotics

What VLA Models Still Cannot Do Safely in Robotics

Vision-language-action models are one of the most important ideas in Physical AI.

They connect perception, language, and robot actions in one learned policy. That is a real step forward. A robot that can look at a scene, understand an instruction, and produce an action sequence is a different class of system from a scripted state machine with a few perception nodes bolted on.

But a VLA model is not a safety architecture.

The useful question for robotics engineers, AI engineers, founders, and CTOs is not “will VLA models work?”

They already work in constrained demonstrations and are improving quickly.

The useful question is:

What should a VLA model still not be allowed to do by itself on a real robot?

My answer: a VLA model should not own command authority, safety limits, actuator timing, workspace admission, recovery behavior, or final responsibility for physical motion. It can propose skills, select goals, interpret scenes, and help generalize behavior, but a separate control and safety stack must validate, supervise, bound, and sometimes reject its outputs before anything reaches actuators.

This article extends the safety boundary from why LLMs should not control motors and robots into the VLA era. It also connects to the broader Physical AI framing in what Physical AI really means, the older VLA overview in how VLA models are changing robotics, and the authority model in splitting responsibility between an LLM, ROS 2, and a microcontroller.

Key takeaways

  • VLA models are promising robot policies, but they should be treated as proposal engines unless the deployment has strict safety envelopes, validation gates, and runtime supervision.
  • The main safety gap is not language understanding. It is command authority under uncertainty: partial observability, distribution shift, contact dynamics, timing, tool use, and recovery.
  • A VLA should not directly own E-stops, safety-rated limits, actuator control loops, collision envelopes, or degraded-mode transitions.
  • Safe deployment needs a layered architecture: VLA inference, skill registry, command validator, planner/controller, runtime supervisor, watchdogs, and hardware-level failsafes.
  • Benchmarking a VLA for robotics should include negative tests: unseen objects, occlusion, ambiguous instructions, stale perception, forbidden zones, moving humans, tool misuse, and recovery from failed grasps.
  • The durable artifact is a VLA safety gate: a matrix that defines what the model may propose, what deterministic software must validate, and what hardware or real-time control must own.

Citation-ready answer

Vision-language-action models should not be trusted as complete robot safety systems. A VLA can map visual observations and language instructions into robot actions, but safe robotics deployment still requires deterministic command validation, workspace limits, collision checking, runtime supervision, watchdogs, degraded modes, real-time control, and hardware failsafes outside the model. The practical rule is simple: let the VLA propose goals or skills; let the robot safety and control stack decide what is physically admissible.

What a VLA actually changes

A VLA model tries to unify three things:

1
2
3
vision observation
-> language/task context
-> action representation

The key research insight is that robot actions can be represented in a format that a large model can learn alongside vision and language. The RT-2 paper describes this direction clearly: actions are expressed as tokens so a vision-language model can be co-trained on web-scale vision-language data and robot trajectory data.

That matters because it gives the robot policy more semantic grounding. The model may generalize better to new objects, instructions, and scene concepts than a narrow imitation policy trained only on one task.

The Open X-Embodiment work pushes the same direction from the data side: more robots, more embodiments, more tasks, more demonstrations. OpenVLA then makes the open-source question more practical by training a 7B-parameter VLA on large real-world robot demonstrations and releasing code and checkpoints.

This is important progress.

It still does not make the learned policy the safety boundary.

The safety problem is authority under uncertainty

Robots fail differently from chatbots.

A bad text answer can be corrected. A bad robot command can hit a fixture, break a gripper, pinch a cable, drop a tool, damage a part, or enter a human workspace.

The hard part is not only “does the VLA understand the instruction?”

The hard part is:

1
2
3
Does this command remain safe under the current physical state,
sensor uncertainty, timing delay, contact condition, actuator limit,
workspace boundary, and recovery path?

A VLA is not naturally good at owning that entire question. It may infer intent from vision and language, but physical admissibility depends on state estimation, calibration, collision geometry, control-loop timing, force limits, tool constraints, safety zones, and hardware health.

Those are system responsibilities.

The VLA safety boundary

Use this as the minimum architecture for a robot that uses a VLA near real hardware:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
camera / sensors
-> perception preprocessor
-> VLA model
-> proposed goal / skill / action tokens
-> command validator
-> schema check
-> workspace check
-> collision check
-> speed / force / torque limits
-> tool and object constraints
-> freshness and confidence checks
-> planner / controller
-> runtime supervisor
-> watchdogs
-> safety envelope
-> degraded mode manager
-> stop / hold / retreat policy
-> robot hardware

The VLA is in the proposal path.

It is not the final authority.

That distinction is the same pattern I use in evaluating a local LLM for robotics tool use: the model may select or parameterize a tool, but a deterministic layer must decide whether the request is admissible.

What VLA models should not own

ResponsibilityWhy the VLA should not own itSafer owner
Emergency stopMust work independently of model behavior and compute healthHardware safety circuit or safety PLC
Joint-level control loopRequires deterministic timing and stabilityReal-time controller or microcontroller
Collision envelopeNeeds geometry, limits, and conservative checksMotion planner and safety supervisor
Workspace admissionMust block forbidden zones even if language is persuasiveCommand validator
Tool authorityTool misuse can create physical riskSkill registry plus policy gate
Degraded-mode transitionDepends on sensor confidence, health, and fault stateRuntime supervisor
Recovery after contactRequires force, timing, and local stateController and supervisor
Human proximity ruleMust be conservative and sensor-drivenSafety-rated or supervised perception layer
Audit and incident traceNeeds replayable system eventsObservability and safety log stack

This is not anti-VLA.

It is pro-robot.

A VLA becomes more useful when it is surrounded by systems that let it be creative where creativity helps and conservative where physics matters.

Failure modes that still matter

A VLA safety review should start with failure modes, not demo videos.

Failure modeExampleRequired control
Ambiguous instruction“Put that over there” with two possible targetsAsk for clarification or restrict to low-risk motion
Object confusionSimilar tools, labels, colors, or partially occluded objectsPerception confidence threshold and target verification
Distribution shiftNew lighting, camera angle, gripper, surface, or fixtureScenario evals and fallback to manual/known skill
Unsafe affordanceModel sees a handle but the object is hot, sharp, powered, or fragileObject risk metadata and tool constraints
Workspace violationProposed path crosses a human zone or keep-out volumeMotion planning and geofence rejection
Timing stalenessCamera frame or TF transform is too oldFreshness gate before command execution
Contact surpriseGrasp slips, object moves, tool catchesForce/torque monitoring and abort policy
Overconfident recoveryModel keeps trying after repeated failureRetry budget and degraded-mode transition
Hidden stateDoor latched, cable attached, fixture lockedState-check skill or human confirmation
Instruction injectionVisual marker or user text requests unsafe behaviorInstruction authority separation

The important pattern is that many failures are not solved by a larger model alone. They require external state, physical constraints, conservative admission rules, and recovery policies.

Command granularity decides risk

Not all VLA outputs have the same risk.

There is a large safety difference between these three outputs:

1
2
3
high-level goal: "sort the blue blocks into the left tray"
skill call: pick_and_place(object=blue_block, target=left_tray)
actuator stream: joint velocities at 100 Hz

The closer the VLA gets to actuator-level output, the more it owns timing, stability, and contact behavior. That is the dangerous direction for most teams.

For practical robotics projects, I prefer this boundary:

VLA output levelProduction recommendation
Scene descriptionUsually safe if not used directly as authority
Goal proposalUseful with validation and human/operator override
Skill selectionUseful if skills are typed, bounded, and logged
Skill parameterizationAcceptable only with strict schema and physical checks
Trajectory proposalRequires planner validation, collision checking, and simulation
Direct actuator commandAvoid outside controlled research or very narrow certified stacks

ROS 2 actions are often a good boundary for long-running robot skills because they support goals, feedback, results, and cancellation. The ROS 2 action design explicitly models an action server that may accept or reject goals, provide feedback, and handle cancel requests. That is the kind of interface a VLA should face, not a raw motor bus.

The ROS 2 actions design is useful here because it separates the requester from the executor. Let the VLA request a bounded action. Let the action server, planner, and supervisor decide whether that goal can execute.

Lifecycle state belongs outside the model

A robot that uses a VLA needs explicit operating modes:

  • disabled
  • calibration
  • manual
  • supervised autonomy
  • autonomous bounded task
  • degraded mode
  • fault hold
  • emergency stop

The model should not silently move the system between these states.

The ROS 2 managed node lifecycle gives a useful software pattern: nodes can be unconfigured, inactive, active, or finalized, with transitions exposed to a supervisory process. The same idea applies at the robot architecture level. VLA inference can be active while robot motion remains inactive. A perception stack can be healthy while an actuator path is fault-held. A planner can accept goals while the supervisor blocks execution.

That separation is not bureaucracy. It is how you avoid one fluent model output becoming system-wide authority.

A practical VLA deployment matrix

Before deploying a VLA near a real robot, fill this matrix.

LayerAllowed VLA influenceHard boundary
PerceptionInterpret scene, describe objects, suggest target candidatesCannot override sensor validity or calibration state
Task planningPropose task sequence or skill choiceCannot bypass skill registry
Skill parametersSuggest object, pose, tray, speed class, toolMust pass schema, workspace, object, and freshness checks
Motion planningProvide goal constraints, not raw motion authorityPlanner owns collision and kinematic feasibility
ControlNo direct control-loop authorityController owns timing, stability, limits
SafetyNo ownership of E-stop, stop category, or safety-rated boundarySafety system owns stop and interlock
RecoverySuggest next step after failureSupervisor owns retry budget and degraded mode
AuditProvide explanationLogs own evidence and replay

If a row says “the VLA owns this alone,” the design is probably too optimistic.

The benchmark plan that matters

Most VLA demos test success on intended tasks.

Production readiness needs tests for wrongness.

Run at least these test classes:

Test classWhat to measurePass condition
Known task successCan the model complete the intended skill?Meets task-specific success rate
Ambiguity handlingDoes it ask or choose safely when the request is unclear?No unsafe default action
Forbidden-zone rejectionDoes the system reject goals outside the workspace?100% rejection
Stale perceptionDoes it execute with old frames or transforms?No execution past freshness limit
Occlusion and clutterDoes it confuse targets under partial visibility?Conservative failure or clarification
Human intrusionDoes motion stop or hold when a person enters the zone?Stop/hold within defined limit
Repeated failureDoes it keep trying after bad grasps or blocked motion?Retry budget enforced
Tool misuseDoes it call the wrong skill or unsafe tool?Skill registry blocks invalid calls
RecoveryDoes it retreat, hold, or degrade safely?Supervisor selects safe state
Audit replayCan you reconstruct the decision chain?Logs include observation, proposal, validation, execution result

The benchmark should include adversarial and boring failures. The boring failures are the ones that happen in real deployments: lighting changes, loose calibration, an operator moving a part, a cable in the way, a gripper pad wearing out, or a fixture moved by 20 millimeters.

Runtime supervision is not optional

The runtime supervisor is the layer that asks:

1
Given the robot's current state, is this command still safe now?

It should check:

  • command age
  • sensor freshness
  • TF transform freshness
  • robot mode
  • actuator health
  • collision state
  • speed and force limits
  • workspace boundary
  • retry count
  • human proximity
  • stop input state

This is close to the architecture in designing degraded modes for AI-enabled robots. A robot should not have only “working” and “failed.” It should have explicit reduced-capability modes where risky functions are disabled while safe diagnostic or manual operations remain available.

For VLA systems, degraded mode is especially important because the AI layer can fail softly. It may still produce plausible outputs when perception is stale, confidence is low, or the task has drifted outside training distribution.

Plausible is not safe.

What good looks like

A VLA-enabled robot is much closer to production when these statements are true:

  • The VLA can only propose goals, skills, or bounded parameters.
  • Every skill has a declared schema, owner, risk tier, and allowed mode.
  • The command validator can reject unsafe outputs without asking the model.
  • The planner checks kinematics, collision, and workspace boundaries.
  • The controller owns real-time timing and actuator limits.
  • The supervisor owns operating mode, degraded mode, retries, stop, and recovery.
  • The hardware safety path still works if the model, GPU, network, ROS graph, or host computer fails.
  • Logs can reconstruct perception input, VLA proposal, validation decision, action execution, and stop reason.
  • Benchmarks include unsafe, ambiguous, stale, occluded, and out-of-distribution scenarios.

That is the difference between a VLA demo and a VLA robot architecture.

FAQ

Are VLA models unsafe by definition?

No. They are not unsafe by definition. They are unsafe when treated as complete control and safety systems. A VLA can be valuable when it is constrained to propose goals or skills and when deterministic systems validate physical admissibility.

Can a VLA directly output robot actions?

In research, yes. In production, direct actuator authority is usually the wrong boundary. A safer architecture converts VLA outputs into bounded goals or skill requests, then uses planners, controllers, supervisors, and hardware safety systems to decide what can actually move.

What is the biggest gap between VLA demos and deployed robots?

The biggest gap is not object recognition. It is robust behavior under physical uncertainty: partial observability, calibration drift, contact dynamics, humans entering the workspace, stale sensor data, tool misuse, and recovery from failed actions.

Should VLA systems use ROS 2 actions?

Often, yes. ROS 2 actions are a good interface for long-running robot skills because they support goal submission, feedback, result handling, and cancellation. They are a better boundary for AI-generated skill requests than raw topics carrying motor-level commands.

How should a team start testing a VLA safely?

Start offline with logged scenes and simulated commands. Then use a shadow mode where the VLA proposes actions but the robot does not execute them. Compare proposals against human/operator decisions and deterministic validators. Only then allow execution of low-risk, bounded skills with hard workspace limits and immediate stop paths.

What should stay outside the model forever?

Emergency stop, safety-rated interlocks, actuator control loops, workspace hard limits, watchdogs, and final stop authority should stay outside the VLA. These are robot safety responsibilities, not language-model responsibilities.