
A voice-controlled robot does not fail only when speech recognition is wrong.
It fails when the whole command loop takes longer than the physical situation can tolerate.
That loop includes wake word detection, voice activity detection, audio buffering, speech-to-text, intent parsing, safety validation, ROS 2 goal dispatch, actuator admission, and user feedback. Each stage can be individually “fast enough” while the combined system still feels sluggish, unsafe, or impossible to debug.
The engineering question is not:
Can the robot transcribe speech locally?
It is:
Can the robot turn a spoken command into a validated physical action before the command becomes stale, ambiguous, or unsafe?
That is what a latency budget is for. It is a contract that says how much time each layer may spend before the command is rejected, downgraded, clarified, or executed.
This expands the local voice stack I covered in running whisper.cpp with CUDA on Jetson Orin Nano Super, running Piper TTS on Jetson, and wake word setup for Whisper-based voice interfaces. Those pieces are useful, but a robot needs an end-to-end timing model before voice becomes a trustworthy control surface.
Key takeaways
- A voice robot latency budget is an end-to-end timing contract from microphone input to accepted, rejected, or executed robot action.
- The budget must separate interaction latency from command authority latency. A spoken reply can be slightly late; unsafe motion admission cannot.
- Wake word, VAD, ASR, LLM parsing, safety validation, ROS 2 action dispatch, and actuator control should each have a deadline and a stale-state policy.
- The LLM should not own the timing-critical path. It can parse, explain, and propose intent, but the robot supervisor must validate freshness, authority, mode, and safety envelope.
- Use ROS 2 actions for long-running robot tasks, QoS deadlines/liveliness for freshness-sensitive streams, and a local controller or microcontroller for hard real-time actuation.
- A good voice interface says “I did not execute that because…” quickly instead of silently queuing old commands.
Citation-ready answer
A latency budget for a voice-controlled robot is the maximum allowed time for each stage between audio capture and physical action: wake word, voice activity detection, ASR, intent parsing, command validation, ROS 2 dispatch, controller admission, and feedback. The budget prevents stale spoken commands from reaching actuators after the scene, robot mode, or operator intent has changed. In a safe architecture, the LLM may interpret language, but a deterministic supervisor validates deadlines, confidence, robot state, safety envelope, and actuator authority before any motion goal is accepted.
Voice control is not one latency number
People often ask whether a local voice robot is “real time.”
That is the wrong question.
A voice interface contains several timing domains:
| Timing domain | Typical concern | Failure if ignored |
|---|---|---|
| Human interaction | Does the robot feel responsive? | The operator repeats commands or loses trust |
| Speech segmentation | Did VAD cut audio at the right time? | Missing words, late ASR, accidental wakeups |
| Language inference | Did ASR and the LLM finish quickly enough? | Intent arrives after context changed |
| Robot supervision | Is the command still safe in the current mode? | Stale or unauthorized action reaches ROS 2 |
| Motion execution | Can controllers meet loop timing? | Jitter, overshoot, missed watchdogs |
| Feedback | Did the user hear what happened? | The operator assumes execution when it was rejected |
Only one of those is about the user “feeling” speed. The rest are about freshness, authority, and physical safety.
This is why voice robots should not be designed as a single pipeline:
1 | microphone -> speech-to-text -> LLM -> robot command -> motor |
That architecture hides the important timing contracts. A safer model is:
1 | microphone |
The transcript is not the command. The command is not the action. The action is not the actuator output.
Each boundary needs a deadline.
A practical end-to-end latency budget
The numbers below are not universal. They are a starting point for a small mobile robot, arm demo, inspection robot, or local Jetson-based Physical AI prototype where speech is a human interface, not the final control loop.
| Stage | Budget target | Timeout policy | Notes |
|---|---|---|---|
| Wake word or push-to-talk gate | 50-200 ms | Ignore if uncertain | Push-to-talk is safer for early prototypes |
| Voice activity detection | 100-300 ms after speech end | Ask user to repeat | Too much hangover makes the whole system feel late |
| Audio chunking / preprocessing | 20-80 ms | Drop incomplete chunk | Keep sample rate and buffer size explicit |
| Local ASR | 300-1500 ms | Return “not understood” | Model size and GPU load dominate here |
| Transcript normalization | 10-50 ms | Reject malformed transcript | Remove filler, preserve original text for logs |
| Intent parsing / LLM call | 200-2000 ms | Ask clarification or reject | The LLM output must be structured, not free text |
| Command validation | 5-50 ms | Reject by default | Check mode, authority, workspace, freshness, rate limits |
| ROS 2 goal dispatch | 10-100 ms | Fail closed | Actions fit long-running goals better than topics |
| Supervisor admission | 5-50 ms | Hold, degrade, or reject | The supervisor owns robot mode and safety envelope |
| Low-level control | 1-10 ms loop period | Watchdog stop | This belongs below the LLM and usually below Linux |
| User feedback | 100-800 ms | Say accepted, rejected, or waiting | Feedback is part of safety, not decoration |
The important column is not the target. It is the timeout policy.
Every stage should answer this question:
If this stage is late, what is the robot still allowed to do?
For a voice command like “come here,” a 1.5 second ASR delay may be acceptable if the robot is stationary and the scene is static. The same delay is unacceptable if the robot is already moving near people, because the command may refer to a human position that changed.
For a command like “stop,” the budget is different. Emergency stop should not depend on ASR, an LLM, Wi-Fi, or TTS. It needs a direct input path and a safety-rated or at least deterministic stop path. Voice can request a normal software stop, but it should not be the only stop mechanism. I made this distinction in robot safety architecture.
Separate conversation latency from authority latency
A common design mistake is to optimize the wrong latency.
For example, a robot might speak back in 400 ms, which feels good, while the motion command is still waiting behind ASR cleanup, an LLM retry, a ROS 2 service timeout, and a safety validator. The operator hears confidence, but the robot has not actually accepted the action.
The opposite is also dangerous. The robot might execute quickly but delay the spoken confirmation, leaving the operator unsure whether the command was accepted, rejected, or queued.
I would split the system into two clocks:
| Clock | What it measures | Hard rule |
|---|---|---|
| Interaction clock | Time until the user receives useful feedback | Reply quickly with state: listening, thinking, accepted, rejected, clarifying |
| Authority clock | Time until a physical command is admitted or rejected | Never execute if the command, scene, mode, or authorization is stale |
The interaction clock can stream partial feedback:
- “Listening.”
- “I heard: inspect the left wheel.”
- “Checking if that is safe.”
- “Rejected: the robot is in assisted mode.”
The authority clock should be stricter:
- The transcript timestamp must be fresh.
- The robot state estimate must be fresh.
- The operator identity or session must still be valid.
- The safety envelope must allow the requested action.
- The robot mode must allow that authority level.
- The action goal must be cancelable or bounded.
That is the same authority separation I described in how to split authority between an LLM, ROS 2, and a microcontroller. Voice input does not change the rule. It only makes stale intent easier to hide.
Where ROS 2 fits in the budget
ROS 2 is not a speech framework. It is the robot communication and execution boundary.
That matters because a spoken command should usually become one of these:
| Spoken intent | ROS 2 interface | Why |
|---|---|---|
| “What is your battery level?” | Service or topic read | Short query, no long-running work |
| “Start inspection route A” | Action | Long-running task with feedback and cancelation |
| “Stop navigation” | Action cancel or supervisor command | Needs preemption, not a queued topic |
| “Set speed to slow” | Parameter/service plus supervisor validation | Changes operating envelope |
| “Move forward a little” | Bounded action, not raw velocity | Needs duration, distance, frame, speed limit |
| “Follow me” | Mode transition plus action | Changes authority and perception requirements |
The ROS 2 action model is useful because actions are designed for long-running goals with feedback and cancelation. The official ROS 2 documentation describes actions as goals that can provide feedback and be canceled or preempted. That maps well to voice commands such as navigation, inspection, docking, and manipulation.
QoS also belongs in the budget. ROS 2 QoS policies include reliability, deadline, lifespan, liveliness, and lease duration. For a voice robot, those are not abstract middleware knobs:
deadlinehelps detect when a freshness-sensitive stream is late.lifespanprevents old messages from being treated as current.livelinesshelps detect that a publisher or node is no longer healthy.- sensor topics may prefer timely best-effort delivery over reliable backlog.
- command and safety state topics usually need stricter delivery and explicit state.
This is why ROS 2 architecture patterns that scale matter for voice robots. A voice interface is only safe if the command enters the ROS graph through the right semantic boundary.
Do not put hard real-time control behind speech
Speech is a human interface. It is not a control loop.
ASR and LLM inference have variable latency. GPU load, thermal limits, model size, audio length, memory pressure, container scheduling, and background perception can all change response time. On Jetson-class devices, NVIDIA’s power and performance documentation is a reminder that power modes, clocks, and thermal behavior are part of deployment reliability, not just benchmark tuning.
The controller should not wait for language.
For a mobile robot, the fast loop should live in the controller stack or microcontroller:
| Layer | Suitable loop timing | Should voice/LLM be involved? |
|---|---|---|
| Motor current / commutation | microseconds to sub-millisecond | No |
| Joint velocity / wheel control | 1-10 ms | No |
| Base velocity control | 10-50 ms | No direct authority |
| Navigation goal supervision | 100 ms to seconds | Voice may request, supervisor decides |
| Task planning | seconds | Voice and LLM can help |
| Explanation / maintenance copilot | seconds to minutes | Yes, advisory |
ROS 2 has real-time design guidance, but the practical rule is simple: do not put non-deterministic language inference on the path that must meet a hard control deadline. Use voice to request intent, not to close the loop.
The command freshness contract
Every voice command should carry metadata. Without metadata, the validator cannot know whether the command is still safe.
I would require at least:
| Field | Example | Why it matters |
|---|---|---|
utterance_id | voice-2026-05-29T09:31:12.420Z | Correlates ASR, LLM output, validation, action, logs |
heard_at | monotonic timestamp | Detects stale transcript |
operator_session | authenticated or local session ID | Prevents orphaned commands |
raw_transcript | “come here slowly” | Preserves evidence |
normalized_intent | navigate_relative | Structured command class |
target_frame | base_link, map, operator_pose | Prevents frame ambiguity |
max_age_ms | 1200 | Reject if too old |
confidence | ASR and parser confidence | Gates clarification |
requested_authority | advisory, goal, mode change | Prevents authority escalation |
safety_envelope | low speed, no manipulation | Binds action to limits |
A minimal validator should then check:
1 | if transcript_age > max_age_ms: reject_stale() |
This is not bureaucracy. It is how you prevent “turn left” from being executed after the robot has already rotated, moved rooms, lost localization, or switched into a restricted mode.
The same evidence should be logged. If the robot does something surprising, the debugging bundle should show the audio timestamp, transcript, parsed command, validation decision, ROS 2 goal, safety state, and actuator feedback. That is the structure I recommend in ROS 2 logs and rosbags for AI-assisted robot debugging.
Failure modes that look like latency bugs
Latency failures are rarely just “too slow.” They usually show up as authority, freshness, or observability failures.
| Symptom | Likely cause | Better design response |
|---|---|---|
| Robot executes an old command | Missing max-age check | Stamp every utterance and reject stale intents |
| Robot says “OK” but does nothing | Feedback path not tied to action state | Confirm accepted, rejected, executing, or completed separately |
| Robot moves after clarification | Old intent reused after user correction | New utterance ID for every clarification turn |
| Stop command feels slow | Stop path depends on ASR/LLM | Provide hardware or direct software stop path |
| Robot ignores user during motion | Action server lacks cancel/preempt path | Use cancelable ROS 2 actions for long-running tasks |
| Speech works in lab but not under load | GPU/CPU contention with perception | Reserve compute, measure p95/p99, shed noncritical tasks |
| ASR accuracy drops near motors | Acoustic noise and VAD thresholds | Add microphone placement, noise tests, push-to-talk fallback |
| Robot acts in wrong frame | Missing frame freshness check | Require target frame and TF freshness before dispatch |
| Logs cannot explain delay | No stage-level timing | Log per-stage timestamps and decision codes |
If you do not log stage timestamps, you cannot tell whether the failure was ASR, LLM parsing, ROS 2 dispatch, action acceptance, controller admission, or user feedback.
Benchmark the budget, not the demo
A useful benchmark plan is boring and repeatable:
| Test | Metric | Pass condition |
|---|---|---|
| Quiet room command | p50/p95 end-to-end latency | Meets nominal budget |
| Noisy motor command | ASR confidence and rejection rate | Rejects uncertainty instead of guessing |
| GPU under perception load | ASR and LLM p95/p99 | Sheds optional work or rejects late command |
| Moving robot stale command | Validator decision | Rejects command after max age |
| Action cancel | Time to cancel or preempt | Cancels within defined safety window |
| Network loss | Mode transition | Holds, degrades, or stays local |
| Long silence | Session state | Returns to wake/push-to-talk gate |
| Ambiguous target | Clarification behavior | Asks before motion |
Run the tests with the robot doing real work. A voice stack that looks fast on an idle desk can become unpredictable when perception, SLAM, logging, containers, and TTS share the same edge device.
For Jetson deployments, I would record at least:
- ASR model and quantization.
- Audio chunk size and sample rate.
- GPU memory pressure.
- CPU governor and Jetson power mode.
- Container limits.
- ROS 2 executor configuration.
- p50, p95, and p99 latency per stage.
- number of rejected commands and why.
- action acceptance and cancel latency.
The goal is not to make every command instant. The goal is to make the timing behavior explicit enough that the robot never surprises you.
Reference architecture
For a local voice-controlled robot, I would build this stack:
| Component | Responsibility | Authority |
|---|---|---|
| Wake/VAD service | Decide when to capture speech | No robot authority |
| ASR service | Produce timestamped transcript | No robot authority |
| Intent parser | Produce structured command proposal | Advisory |
| Command validator | Check mode, freshness, confidence, frame, limits | Admission authority |
| ROS 2 action client | Dispatch bounded goals | Limited command authority |
| Supervisor / mode manager | Own robot mode and safety envelope | High authority |
| Controller / MCU | Enforce low-level timing and watchdogs | Actuator authority |
| TTS service | Report state to operator | No actuator authority |
| Logger / rosbag recorder | Preserve evidence | Observability only |
The LLM can be useful in the intent parser and explanation layers. It can turn “go inspect the left wheel slowly” into a structured proposal:
1 | { |
But it should not publish raw velocity, choose its own safety limits, override robot mode, or silently retry physical commands. That authority belongs to deterministic layers that can be tested.
FAQ
What is a good latency target for a voice-controlled robot?
For conversational feedback, sub-second response feels much better, but many robot commands can tolerate one to two seconds if the robot is stationary and the command is not safety-critical. For physical authority, use freshness deadlines instead of a single UX target. A stale command should be rejected even if the user interface feels responsive.
Should a voice command become a ROS 2 topic, service, or action?
Use topics for continuous state, services for short bounded queries, and actions for long-running robot goals that need feedback and cancellation. Most navigation, inspection, docking, and manipulation requests should become bounded ROS 2 actions after validation.
Can an LLM control a robot by voice?
An LLM can parse intent, ask clarifying questions, explain robot state, and propose structured commands. It should not directly control motors or bypass safety validation. The command validator, supervisor, and controller should own physical authority.
Is local ASR on Jetson fast enough?
It can be, depending on model size, quantization, GPU load, audio chunking, and power mode. The real question is whether p95 and p99 latency remain inside your command freshness budget while perception, logging, TTS, and ROS 2 nodes are also running.
What should happen when the voice pipeline is late?
Reject or downgrade the command. A late transcript should not be executed just because it eventually arrived. The robot can say, “That command expired; please repeat it,” then stay in its current safe mode.
Does voice control replace an E-stop?
No. Voice can request a normal software stop, but an emergency stop should not depend on ASR, an LLM, cloud access, or a speech session. A real robot still needs a direct stop path appropriate for its risk level.