What Latency Budget Really Means in a Voice-Controlled Robot

What Latency Budget Really Means in a Voice-Controlled Robot

A voice-controlled robot does not fail only when speech recognition is wrong.

It fails when the whole command loop takes longer than the physical situation can tolerate.

That loop includes wake word detection, voice activity detection, audio buffering, speech-to-text, intent parsing, safety validation, ROS 2 goal dispatch, actuator admission, and user feedback. Each stage can be individually “fast enough” while the combined system still feels sluggish, unsafe, or impossible to debug.

The engineering question is not:

Can the robot transcribe speech locally?

It is:

Can the robot turn a spoken command into a validated physical action before the command becomes stale, ambiguous, or unsafe?

That is what a latency budget is for. It is a contract that says how much time each layer may spend before the command is rejected, downgraded, clarified, or executed.

This expands the local voice stack I covered in running whisper.cpp with CUDA on Jetson Orin Nano Super, running Piper TTS on Jetson, and wake word setup for Whisper-based voice interfaces. Those pieces are useful, but a robot needs an end-to-end timing model before voice becomes a trustworthy control surface.

Key takeaways

  • A voice robot latency budget is an end-to-end timing contract from microphone input to accepted, rejected, or executed robot action.
  • The budget must separate interaction latency from command authority latency. A spoken reply can be slightly late; unsafe motion admission cannot.
  • Wake word, VAD, ASR, LLM parsing, safety validation, ROS 2 action dispatch, and actuator control should each have a deadline and a stale-state policy.
  • The LLM should not own the timing-critical path. It can parse, explain, and propose intent, but the robot supervisor must validate freshness, authority, mode, and safety envelope.
  • Use ROS 2 actions for long-running robot tasks, QoS deadlines/liveliness for freshness-sensitive streams, and a local controller or microcontroller for hard real-time actuation.
  • A good voice interface says “I did not execute that because…” quickly instead of silently queuing old commands.

Citation-ready answer

A latency budget for a voice-controlled robot is the maximum allowed time for each stage between audio capture and physical action: wake word, voice activity detection, ASR, intent parsing, command validation, ROS 2 dispatch, controller admission, and feedback. The budget prevents stale spoken commands from reaching actuators after the scene, robot mode, or operator intent has changed. In a safe architecture, the LLM may interpret language, but a deterministic supervisor validates deadlines, confidence, robot state, safety envelope, and actuator authority before any motion goal is accepted.

Voice control is not one latency number

People often ask whether a local voice robot is “real time.”

That is the wrong question.

A voice interface contains several timing domains:

Timing domainTypical concernFailure if ignored
Human interactionDoes the robot feel responsive?The operator repeats commands or loses trust
Speech segmentationDid VAD cut audio at the right time?Missing words, late ASR, accidental wakeups
Language inferenceDid ASR and the LLM finish quickly enough?Intent arrives after context changed
Robot supervisionIs the command still safe in the current mode?Stale or unauthorized action reaches ROS 2
Motion executionCan controllers meet loop timing?Jitter, overshoot, missed watchdogs
FeedbackDid the user hear what happened?The operator assumes execution when it was rejected

Only one of those is about the user “feeling” speed. The rest are about freshness, authority, and physical safety.

This is why voice robots should not be designed as a single pipeline:

1
microphone -> speech-to-text -> LLM -> robot command -> motor

That architecture hides the important timing contracts. A safer model is:

1
2
3
4
5
6
7
8
9
microphone
-> wake/VAD gate
-> ASR transcript with timestamp and confidence
-> intent parser
-> command validator
-> ROS 2 action or service boundary
-> supervisor / mode manager
-> controller / microcontroller
-> actuator

The transcript is not the command. The command is not the action. The action is not the actuator output.

Each boundary needs a deadline.

A practical end-to-end latency budget

The numbers below are not universal. They are a starting point for a small mobile robot, arm demo, inspection robot, or local Jetson-based Physical AI prototype where speech is a human interface, not the final control loop.

StageBudget targetTimeout policyNotes
Wake word or push-to-talk gate50-200 msIgnore if uncertainPush-to-talk is safer for early prototypes
Voice activity detection100-300 ms after speech endAsk user to repeatToo much hangover makes the whole system feel late
Audio chunking / preprocessing20-80 msDrop incomplete chunkKeep sample rate and buffer size explicit
Local ASR300-1500 msReturn “not understood”Model size and GPU load dominate here
Transcript normalization10-50 msReject malformed transcriptRemove filler, preserve original text for logs
Intent parsing / LLM call200-2000 msAsk clarification or rejectThe LLM output must be structured, not free text
Command validation5-50 msReject by defaultCheck mode, authority, workspace, freshness, rate limits
ROS 2 goal dispatch10-100 msFail closedActions fit long-running goals better than topics
Supervisor admission5-50 msHold, degrade, or rejectThe supervisor owns robot mode and safety envelope
Low-level control1-10 ms loop periodWatchdog stopThis belongs below the LLM and usually below Linux
User feedback100-800 msSay accepted, rejected, or waitingFeedback is part of safety, not decoration

The important column is not the target. It is the timeout policy.

Every stage should answer this question:

If this stage is late, what is the robot still allowed to do?

For a voice command like “come here,” a 1.5 second ASR delay may be acceptable if the robot is stationary and the scene is static. The same delay is unacceptable if the robot is already moving near people, because the command may refer to a human position that changed.

For a command like “stop,” the budget is different. Emergency stop should not depend on ASR, an LLM, Wi-Fi, or TTS. It needs a direct input path and a safety-rated or at least deterministic stop path. Voice can request a normal software stop, but it should not be the only stop mechanism. I made this distinction in robot safety architecture.

Separate conversation latency from authority latency

A common design mistake is to optimize the wrong latency.

For example, a robot might speak back in 400 ms, which feels good, while the motion command is still waiting behind ASR cleanup, an LLM retry, a ROS 2 service timeout, and a safety validator. The operator hears confidence, but the robot has not actually accepted the action.

The opposite is also dangerous. The robot might execute quickly but delay the spoken confirmation, leaving the operator unsure whether the command was accepted, rejected, or queued.

I would split the system into two clocks:

ClockWhat it measuresHard rule
Interaction clockTime until the user receives useful feedbackReply quickly with state: listening, thinking, accepted, rejected, clarifying
Authority clockTime until a physical command is admitted or rejectedNever execute if the command, scene, mode, or authorization is stale

The interaction clock can stream partial feedback:

  • “Listening.”
  • “I heard: inspect the left wheel.”
  • “Checking if that is safe.”
  • “Rejected: the robot is in assisted mode.”

The authority clock should be stricter:

  • The transcript timestamp must be fresh.
  • The robot state estimate must be fresh.
  • The operator identity or session must still be valid.
  • The safety envelope must allow the requested action.
  • The robot mode must allow that authority level.
  • The action goal must be cancelable or bounded.

That is the same authority separation I described in how to split authority between an LLM, ROS 2, and a microcontroller. Voice input does not change the rule. It only makes stale intent easier to hide.

Where ROS 2 fits in the budget

ROS 2 is not a speech framework. It is the robot communication and execution boundary.

That matters because a spoken command should usually become one of these:

Spoken intentROS 2 interfaceWhy
“What is your battery level?”Service or topic readShort query, no long-running work
“Start inspection route A”ActionLong-running task with feedback and cancelation
“Stop navigation”Action cancel or supervisor commandNeeds preemption, not a queued topic
“Set speed to slow”Parameter/service plus supervisor validationChanges operating envelope
“Move forward a little”Bounded action, not raw velocityNeeds duration, distance, frame, speed limit
“Follow me”Mode transition plus actionChanges authority and perception requirements

The ROS 2 action model is useful because actions are designed for long-running goals with feedback and cancelation. The official ROS 2 documentation describes actions as goals that can provide feedback and be canceled or preempted. That maps well to voice commands such as navigation, inspection, docking, and manipulation.

QoS also belongs in the budget. ROS 2 QoS policies include reliability, deadline, lifespan, liveliness, and lease duration. For a voice robot, those are not abstract middleware knobs:

  • deadline helps detect when a freshness-sensitive stream is late.
  • lifespan prevents old messages from being treated as current.
  • liveliness helps detect that a publisher or node is no longer healthy.
  • sensor topics may prefer timely best-effort delivery over reliable backlog.
  • command and safety state topics usually need stricter delivery and explicit state.

This is why ROS 2 architecture patterns that scale matter for voice robots. A voice interface is only safe if the command enters the ROS graph through the right semantic boundary.

Do not put hard real-time control behind speech

Speech is a human interface. It is not a control loop.

ASR and LLM inference have variable latency. GPU load, thermal limits, model size, audio length, memory pressure, container scheduling, and background perception can all change response time. On Jetson-class devices, NVIDIA’s power and performance documentation is a reminder that power modes, clocks, and thermal behavior are part of deployment reliability, not just benchmark tuning.

The controller should not wait for language.

For a mobile robot, the fast loop should live in the controller stack or microcontroller:

LayerSuitable loop timingShould voice/LLM be involved?
Motor current / commutationmicroseconds to sub-millisecondNo
Joint velocity / wheel control1-10 msNo
Base velocity control10-50 msNo direct authority
Navigation goal supervision100 ms to secondsVoice may request, supervisor decides
Task planningsecondsVoice and LLM can help
Explanation / maintenance copilotseconds to minutesYes, advisory

ROS 2 has real-time design guidance, but the practical rule is simple: do not put non-deterministic language inference on the path that must meet a hard control deadline. Use voice to request intent, not to close the loop.

The command freshness contract

Every voice command should carry metadata. Without metadata, the validator cannot know whether the command is still safe.

I would require at least:

FieldExampleWhy it matters
utterance_idvoice-2026-05-29T09:31:12.420ZCorrelates ASR, LLM output, validation, action, logs
heard_atmonotonic timestampDetects stale transcript
operator_sessionauthenticated or local session IDPrevents orphaned commands
raw_transcript“come here slowly”Preserves evidence
normalized_intentnavigate_relativeStructured command class
target_framebase_link, map, operator_posePrevents frame ambiguity
max_age_ms1200Reject if too old
confidenceASR and parser confidenceGates clarification
requested_authorityadvisory, goal, mode changePrevents authority escalation
safety_envelopelow speed, no manipulationBinds action to limits

A minimal validator should then check:

1
2
3
4
5
6
7
8
if transcript_age > max_age_ms: reject_stale()
if robot_state_age > state_max_age_ms: reject_state_stale()
if parser_confidence < threshold: ask_clarification()
if command_not_allowed_in_mode: reject_authority()
if target_frame_missing_or_old: reject_frame()
if safety_envelope_missing: reject_unsafe()
if action_server_unavailable: reject_unavailable()
else: dispatch_bounded_action()

This is not bureaucracy. It is how you prevent “turn left” from being executed after the robot has already rotated, moved rooms, lost localization, or switched into a restricted mode.

The same evidence should be logged. If the robot does something surprising, the debugging bundle should show the audio timestamp, transcript, parsed command, validation decision, ROS 2 goal, safety state, and actuator feedback. That is the structure I recommend in ROS 2 logs and rosbags for AI-assisted robot debugging.

Failure modes that look like latency bugs

Latency failures are rarely just “too slow.” They usually show up as authority, freshness, or observability failures.

SymptomLikely causeBetter design response
Robot executes an old commandMissing max-age checkStamp every utterance and reject stale intents
Robot says “OK” but does nothingFeedback path not tied to action stateConfirm accepted, rejected, executing, or completed separately
Robot moves after clarificationOld intent reused after user correctionNew utterance ID for every clarification turn
Stop command feels slowStop path depends on ASR/LLMProvide hardware or direct software stop path
Robot ignores user during motionAction server lacks cancel/preempt pathUse cancelable ROS 2 actions for long-running tasks
Speech works in lab but not under loadGPU/CPU contention with perceptionReserve compute, measure p95/p99, shed noncritical tasks
ASR accuracy drops near motorsAcoustic noise and VAD thresholdsAdd microphone placement, noise tests, push-to-talk fallback
Robot acts in wrong frameMissing frame freshness checkRequire target frame and TF freshness before dispatch
Logs cannot explain delayNo stage-level timingLog per-stage timestamps and decision codes

If you do not log stage timestamps, you cannot tell whether the failure was ASR, LLM parsing, ROS 2 dispatch, action acceptance, controller admission, or user feedback.

Benchmark the budget, not the demo

A useful benchmark plan is boring and repeatable:

TestMetricPass condition
Quiet room commandp50/p95 end-to-end latencyMeets nominal budget
Noisy motor commandASR confidence and rejection rateRejects uncertainty instead of guessing
GPU under perception loadASR and LLM p95/p99Sheds optional work or rejects late command
Moving robot stale commandValidator decisionRejects command after max age
Action cancelTime to cancel or preemptCancels within defined safety window
Network lossMode transitionHolds, degrades, or stays local
Long silenceSession stateReturns to wake/push-to-talk gate
Ambiguous targetClarification behaviorAsks before motion

Run the tests with the robot doing real work. A voice stack that looks fast on an idle desk can become unpredictable when perception, SLAM, logging, containers, and TTS share the same edge device.

For Jetson deployments, I would record at least:

  • ASR model and quantization.
  • Audio chunk size and sample rate.
  • GPU memory pressure.
  • CPU governor and Jetson power mode.
  • Container limits.
  • ROS 2 executor configuration.
  • p50, p95, and p99 latency per stage.
  • number of rejected commands and why.
  • action acceptance and cancel latency.

The goal is not to make every command instant. The goal is to make the timing behavior explicit enough that the robot never surprises you.

Reference architecture

For a local voice-controlled robot, I would build this stack:

ComponentResponsibilityAuthority
Wake/VAD serviceDecide when to capture speechNo robot authority
ASR serviceProduce timestamped transcriptNo robot authority
Intent parserProduce structured command proposalAdvisory
Command validatorCheck mode, freshness, confidence, frame, limitsAdmission authority
ROS 2 action clientDispatch bounded goalsLimited command authority
Supervisor / mode managerOwn robot mode and safety envelopeHigh authority
Controller / MCUEnforce low-level timing and watchdogsActuator authority
TTS serviceReport state to operatorNo actuator authority
Logger / rosbag recorderPreserve evidenceObservability only

The LLM can be useful in the intent parser and explanation layers. It can turn “go inspect the left wheel slowly” into a structured proposal:

1
2
3
4
5
6
7
{
"intent": "inspect_component",
"target": "left_wheel",
"speed_profile": "slow",
"requires_motion": true,
"needs_clarification": false
}

But it should not publish raw velocity, choose its own safety limits, override robot mode, or silently retry physical commands. That authority belongs to deterministic layers that can be tested.

FAQ

What is a good latency target for a voice-controlled robot?

For conversational feedback, sub-second response feels much better, but many robot commands can tolerate one to two seconds if the robot is stationary and the command is not safety-critical. For physical authority, use freshness deadlines instead of a single UX target. A stale command should be rejected even if the user interface feels responsive.

Should a voice command become a ROS 2 topic, service, or action?

Use topics for continuous state, services for short bounded queries, and actions for long-running robot goals that need feedback and cancellation. Most navigation, inspection, docking, and manipulation requests should become bounded ROS 2 actions after validation.

Can an LLM control a robot by voice?

An LLM can parse intent, ask clarifying questions, explain robot state, and propose structured commands. It should not directly control motors or bypass safety validation. The command validator, supervisor, and controller should own physical authority.

Is local ASR on Jetson fast enough?

It can be, depending on model size, quantization, GPU load, audio chunking, and power mode. The real question is whether p95 and p99 latency remain inside your command freshness budget while perception, logging, TTS, and ROS 2 nodes are also running.

What should happen when the voice pipeline is late?

Reject or downgrade the command. A late transcript should not be executed just because it eventually arrived. The robot can say, “That command expired; please repeat it,” then stay in its current safe mode.

Does voice control replace an E-stop?

No. Voice can request a normal software stop, but an emergency stop should not depend on ASR, an LLM, cloud access, or a speech session. A real robot still needs a direct stop path appropriate for its risk level.

Sources