What Latency Budget Really Means in a Voice-Controlled Robot

A voice-controlled robot does not fail only when speech recognition is wrong.

It fails when the whole command loop takes longer than the physical situation can tolerate.

That loop includes wake word detection, voice activity detection, audio buffering, speech-to-text, intent parsing, safety validation, ROS 2 goal dispatch, actuator admission, and user feedback. Each stage can be individually “fast enough” while the combined system still feels sluggish, unsafe, or impossible to debug.

The engineering question is not:

Can the robot transcribe speech locally?

It is:

Can the robot turn a spoken command into a validated physical action before the command becomes stale, ambiguous, or unsafe?

That is what a latency budget is for. It is a contract that says how much time each layer may spend before the command is rejected, downgraded, clarified, or executed.

This expands the local voice stack I covered in running whisper.cpp with CUDA on Jetson Orin Nano Super, running Piper TTS on Jetson, and wake word setup for Whisper-based voice interfaces. Those pieces are useful, but a robot needs an end-to-end timing model before voice becomes a trustworthy control surface.

Key takeaways

A voice robot latency budget is an end-to-end timing contract from microphone input to accepted, rejected, or executed robot action.
The budget must separate interaction latency from command authority latency. A spoken reply can be slightly late; unsafe motion admission cannot.
Wake word, VAD, ASR, LLM parsing, safety validation, ROS 2 action dispatch, and actuator control should each have a deadline and a stale-state policy.
The LLM should not own the timing-critical path. It can parse, explain, and propose intent, but the robot supervisor must validate freshness, authority, mode, and safety envelope.
Use ROS 2 actions for long-running robot tasks, QoS deadlines/liveliness for freshness-sensitive streams, and a local controller or microcontroller for hard real-time actuation.
A good voice interface says “I did not execute that because…” quickly instead of silently queuing old commands.

Citation-ready answer

A latency budget for a voice-controlled robot is the maximum allowed time for each stage between audio capture and physical action: wake word, voice activity detection, ASR, intent parsing, command validation, ROS 2 dispatch, controller admission, and feedback. The budget prevents stale spoken commands from reaching actuators after the scene, robot mode, or operator intent has changed. In a safe architecture, the LLM may interpret language, but a deterministic supervisor validates deadlines, confidence, robot state, safety envelope, and actuator authority before any motion goal is accepted.

Voice control is not one latency number

People often ask whether a local voice robot is “real time.”

That is the wrong question.

A voice interface contains several timing domains:

Timing domain	Typical concern	Failure if ignored
Human interaction	Does the robot feel responsive?	The operator repeats commands or loses trust
Speech segmentation	Did VAD cut audio at the right time?	Missing words, late ASR, accidental wakeups
Language inference	Did ASR and the LLM finish quickly enough?	Intent arrives after context changed
Robot supervision	Is the command still safe in the current mode?	Stale or unauthorized action reaches ROS 2
Motion execution	Can controllers meet loop timing?	Jitter, overshoot, missed watchdogs
Feedback	Did the user hear what happened?	The operator assumes execution when it was rejected

Only one of those is about the user “feeling” speed. The rest are about freshness, authority, and physical safety.

This is why voice robots should not be designed as a single pipeline:

1	microphone -> speech-to-text -> LLM -> robot command -> motor

That architecture hides the important timing contracts. A safer model is:

microphone
  -> wake/VAD gate
  -> ASR transcript with timestamp and confidence
  -> intent parser
  -> command validator
  -> ROS 2 action or service boundary
  -> supervisor / mode manager
  -> controller / microcontroller
  -> actuator

The transcript is not the command. The command is not the action. The action is not the actuator output.

Each boundary needs a deadline.

A practical end-to-end latency budget

The numbers below are not universal. They are a starting point for a small mobile robot, arm demo, inspection robot, or local Jetson-based Physical AI prototype where speech is a human interface, not the final control loop.

Stage	Budget target	Timeout policy	Notes
Wake word or push-to-talk gate	50-200 ms	Ignore if uncertain	Push-to-talk is safer for early prototypes
Voice activity detection	100-300 ms after speech end	Ask user to repeat	Too much hangover makes the whole system feel late
Audio chunking / preprocessing	20-80 ms	Drop incomplete chunk	Keep sample rate and buffer size explicit
Local ASR	300-1500 ms	Return “not understood”	Model size and GPU load dominate here
Transcript normalization	10-50 ms	Reject malformed transcript	Remove filler, preserve original text for logs
Intent parsing / LLM call	200-2000 ms	Ask clarification or reject	The LLM output must be structured, not free text
Command validation	5-50 ms	Reject by default	Check mode, authority, workspace, freshness, rate limits
ROS 2 goal dispatch	10-100 ms	Fail closed	Actions fit long-running goals better than topics
Supervisor admission	5-50 ms	Hold, degrade, or reject	The supervisor owns robot mode and safety envelope
Low-level control	1-10 ms loop period	Watchdog stop	This belongs below the LLM and usually below Linux
User feedback	100-800 ms	Say accepted, rejected, or waiting	Feedback is part of safety, not decoration

The important column is not the target. It is the timeout policy.

Every stage should answer this question:

If this stage is late, what is the robot still allowed to do?

For a voice command like “come here,” a 1.5 second ASR delay may be acceptable if the robot is stationary and the scene is static. The same delay is unacceptable if the robot is already moving near people, because the command may refer to a human position that changed.

For a command like “stop,” the budget is different. Emergency stop should not depend on ASR, an LLM, Wi-Fi, or TTS. It needs a direct input path and a safety-rated or at least deterministic stop path. Voice can request a normal software stop, but it should not be the only stop mechanism. I made this distinction in robot safety architecture.

Separate conversation latency from authority latency

A common design mistake is to optimize the wrong latency.

For example, a robot might speak back in 400 ms, which feels good, while the motion command is still waiting behind ASR cleanup, an LLM retry, a ROS 2 service timeout, and a safety validator. The operator hears confidence, but the robot has not actually accepted the action.

The opposite is also dangerous. The robot might execute quickly but delay the spoken confirmation, leaving the operator unsure whether the command was accepted, rejected, or queued.

I would split the system into two clocks:

Clock	What it measures	Hard rule
Interaction clock	Time until the user receives useful feedback	Reply quickly with state: listening, thinking, accepted, rejected, clarifying
Authority clock	Time until a physical command is admitted or rejected	Never execute if the command, scene, mode, or authorization is stale

The interaction clock can stream partial feedback:

“Listening.”
“I heard: inspect the left wheel.”
“Checking if that is safe.”
“Rejected: the robot is in assisted mode.”

The authority clock should be stricter:

The transcript timestamp must be fresh.
The robot state estimate must be fresh.
The operator identity or session must still be valid.
The safety envelope must allow the requested action.
The robot mode must allow that authority level.
The action goal must be cancelable or bounded.

That is the same authority separation I described in how to split authority between an LLM, ROS 2, and a microcontroller. Voice input does not change the rule. It only makes stale intent easier to hide.

Where ROS 2 fits in the budget

ROS 2 is not a speech framework. It is the robot communication and execution boundary.

That matters because a spoken command should usually become one of these:

Spoken intent	ROS 2 interface	Why
“What is your battery level?”	Service or topic read	Short query, no long-running work
“Start inspection route A”	Action	Long-running task with feedback and cancelation
“Stop navigation”	Action cancel or supervisor command	Needs preemption, not a queued topic
“Set speed to slow”	Parameter/service plus supervisor validation	Changes operating envelope
“Move forward a little”	Bounded action, not raw velocity	Needs duration, distance, frame, speed limit
“Follow me”	Mode transition plus action	Changes authority and perception requirements

The ROS 2 action model is useful because actions are designed for long-running goals with feedback and cancelation. The official ROS 2 documentation describes actions as goals that can provide feedback and be canceled or preempted. That maps well to voice commands such as navigation, inspection, docking, and manipulation.

QoS also belongs in the budget. ROS 2 QoS policies include reliability, deadline, lifespan, liveliness, and lease duration. For a voice robot, those are not abstract middleware knobs:

deadline helps detect when a freshness-sensitive stream is late.
lifespan prevents old messages from being treated as current.
liveliness helps detect that a publisher or node is no longer healthy.
sensor topics may prefer timely best-effort delivery over reliable backlog.
command and safety state topics usually need stricter delivery and explicit state.

This is why ROS 2 architecture patterns that scale matter for voice robots. A voice interface is only safe if the command enters the ROS graph through the right semantic boundary.

Do not put hard real-time control behind speech

Speech is a human interface. It is not a control loop.

ASR and LLM inference have variable latency. GPU load, thermal limits, model size, audio length, memory pressure, container scheduling, and background perception can all change response time. On Jetson-class devices, NVIDIA’s power and performance documentation is a reminder that power modes, clocks, and thermal behavior are part of deployment reliability, not just benchmark tuning.

The controller should not wait for language.

For a mobile robot, the fast loop should live in the controller stack or microcontroller:

Layer	Suitable loop timing	Should voice/LLM be involved?
Motor current / commutation	microseconds to sub-millisecond	No
Joint velocity / wheel control	1-10 ms	No
Base velocity control	10-50 ms	No direct authority
Navigation goal supervision	100 ms to seconds	Voice may request, supervisor decides
Task planning	seconds	Voice and LLM can help
Explanation / maintenance copilot	seconds to minutes	Yes, advisory

ROS 2 has real-time design guidance, but the practical rule is simple: do not put non-deterministic language inference on the path that must meet a hard control deadline. Use voice to request intent, not to close the loop.

The command freshness contract

Every voice command should carry metadata. Without metadata, the validator cannot know whether the command is still safe.

I would require at least:

Field	Example	Why it matters
`utterance_id`	`voice-2026-05-29T09:31:12.420Z`	Correlates ASR, LLM output, validation, action, logs
`heard_at`	monotonic timestamp	Detects stale transcript
`operator_session`	authenticated or local session ID	Prevents orphaned commands
`raw_transcript`	“come here slowly”	Preserves evidence
`normalized_intent`	`navigate_relative`	Structured command class
`target_frame`	`base_link`, `map`, `operator_pose`	Prevents frame ambiguity
`max_age_ms`	`1200`	Reject if too old
`confidence`	ASR and parser confidence	Gates clarification
`requested_authority`	advisory, goal, mode change	Prevents authority escalation
`safety_envelope`	low speed, no manipulation	Binds action to limits

A minimal validator should then check:

if transcript_age > max_age_ms: reject_stale()
if robot_state_age > state_max_age_ms: reject_state_stale()
if parser_confidence < threshold: ask_clarification()
if command_not_allowed_in_mode: reject_authority()
if target_frame_missing_or_old: reject_frame()
if safety_envelope_missing: reject_unsafe()
if action_server_unavailable: reject_unavailable()
else: dispatch_bounded_action()

This is not bureaucracy. It is how you prevent “turn left” from being executed after the robot has already rotated, moved rooms, lost localization, or switched into a restricted mode.

The same evidence should be logged. If the robot does something surprising, the debugging bundle should show the audio timestamp, transcript, parsed command, validation decision, ROS 2 goal, safety state, and actuator feedback. That is the structure I recommend in ROS 2 logs and rosbags for AI-assisted robot debugging.

Failure modes that look like latency bugs

Latency failures are rarely just “too slow.” They usually show up as authority, freshness, or observability failures.

Symptom	Likely cause	Better design response
Robot executes an old command	Missing max-age check	Stamp every utterance and reject stale intents
Robot says “OK” but does nothing	Feedback path not tied to action state	Confirm accepted, rejected, executing, or completed separately
Robot moves after clarification	Old intent reused after user correction	New utterance ID for every clarification turn
Stop command feels slow	Stop path depends on ASR/LLM	Provide hardware or direct software stop path
Robot ignores user during motion	Action server lacks cancel/preempt path	Use cancelable ROS 2 actions for long-running tasks
Speech works in lab but not under load	GPU/CPU contention with perception	Reserve compute, measure p95/p99, shed noncritical tasks
ASR accuracy drops near motors	Acoustic noise and VAD thresholds	Add microphone placement, noise tests, push-to-talk fallback
Robot acts in wrong frame	Missing frame freshness check	Require target frame and TF freshness before dispatch
Logs cannot explain delay	No stage-level timing	Log per-stage timestamps and decision codes

If you do not log stage timestamps, you cannot tell whether the failure was ASR, LLM parsing, ROS 2 dispatch, action acceptance, controller admission, or user feedback.

Benchmark the budget, not the demo

A useful benchmark plan is boring and repeatable:

Test	Metric	Pass condition
Quiet room command	p50/p95 end-to-end latency	Meets nominal budget
Noisy motor command	ASR confidence and rejection rate	Rejects uncertainty instead of guessing
GPU under perception load	ASR and LLM p95/p99	Sheds optional work or rejects late command
Moving robot stale command	Validator decision	Rejects command after max age
Action cancel	Time to cancel or preempt	Cancels within defined safety window
Network loss	Mode transition	Holds, degrades, or stays local
Long silence	Session state	Returns to wake/push-to-talk gate
Ambiguous target	Clarification behavior	Asks before motion

Run the tests with the robot doing real work. A voice stack that looks fast on an idle desk can become unpredictable when perception, SLAM, logging, containers, and TTS share the same edge device.

For Jetson deployments, I would record at least:

ASR model and quantization.
Audio chunk size and sample rate.
GPU memory pressure.
CPU governor and Jetson power mode.
Container limits.
ROS 2 executor configuration.
p50, p95, and p99 latency per stage.
number of rejected commands and why.
action acceptance and cancel latency.

The goal is not to make every command instant. The goal is to make the timing behavior explicit enough that the robot never surprises you.

Reference architecture

For a local voice-controlled robot, I would build this stack:

Component	Responsibility	Authority
Wake/VAD service	Decide when to capture speech	No robot authority
ASR service	Produce timestamped transcript	No robot authority
Intent parser	Produce structured command proposal	Advisory
Command validator	Check mode, freshness, confidence, frame, limits	Admission authority
ROS 2 action client	Dispatch bounded goals	Limited command authority
Supervisor / mode manager	Own robot mode and safety envelope	High authority
Controller / MCU	Enforce low-level timing and watchdogs	Actuator authority
TTS service	Report state to operator	No actuator authority
Logger / rosbag recorder	Preserve evidence	Observability only

The LLM can be useful in the intent parser and explanation layers. It can turn “go inspect the left wheel slowly” into a structured proposal:

{
  "intent": "inspect_component",
  "target": "left_wheel",
  "speed_profile": "slow",
  "requires_motion": true,
  "needs_clarification": false
}

But it should not publish raw velocity, choose its own safety limits, override robot mode, or silently retry physical commands. That authority belongs to deterministic layers that can be tested.

FAQ

What is a good latency target for a voice-controlled robot?

For conversational feedback, sub-second response feels much better, but many robot commands can tolerate one to two seconds if the robot is stationary and the command is not safety-critical. For physical authority, use freshness deadlines instead of a single UX target. A stale command should be rejected even if the user interface feels responsive.

Should a voice command become a ROS 2 topic, service, or action?

Use topics for continuous state, services for short bounded queries, and actions for long-running robot goals that need feedback and cancellation. Most navigation, inspection, docking, and manipulation requests should become bounded ROS 2 actions after validation.

Can an LLM control a robot by voice?

An LLM can parse intent, ask clarifying questions, explain robot state, and propose structured commands. It should not directly control motors or bypass safety validation. The command validator, supervisor, and controller should own physical authority.

Is local ASR on Jetson fast enough?

It can be, depending on model size, quantization, GPU load, audio chunking, and power mode. The real question is whether p95 and p99 latency remain inside your command freshness budget while perception, logging, TTS, and ROS 2 nodes are also running.

What should happen when the voice pipeline is late?

Reject or downgrade the command. A late transcript should not be executed just because it eventually arrived. The robot can say, “That command expired; please repeat it,” then stay in its current safe mode.

Does voice control replace an E-stop?

No. Voice can request a normal software stop, but an emergency stop should not depend on ASR, an LLM, cloud access, or a speech session. A real robot still needs a direct stop path appropriate for its risk level.

Physical AI, Robotics
and Product Engineering