Real time Speech To Text for robots - Running whisper.cpp with CUDA on Jetson Orin Nano Super

A practical journey from high latency to usable real-time speech-to-text

Real-time speech-to-text technology is transforming the way humans interact with machines, especially in robotics and automation. This technology relies on processing audio streams in real time, making it a key enabler of automation in robotics by allowing seamless communication and control. The integration of artificial intelligence in robotics further enhances the autonomy and capabilities of robots, making real-time speech-to-text a critical feature for advanced robotic systems.

The global market value of industrial robot installations has reached an all-time high of US$ 16.7 billion, reflecting the growing adoption of robotics across industries. Additionally, the field of humanoid robotics is expanding rapidly, particularly in environments designed for humans, underscoring the increasing relevance of real-time speech-to-text solutions.

1. Why the Jetson Orin Nano Super?

The NVIDIA Jetson Orin Nano Super is a compact edge-AI development kit designed for robotics, embedded AI, and real-time perception workloads. The Nvidia Jetson Orin Nano Super is part of the nano super developer kit, and using the latest software is essential for optimal performance and compatibility.

Advanced technology and interdisciplinary engineering—including mechanical, electrical, and materials engineering—play a key role in the design and capabilities of the Jetson Orin Nano Super for robotics applications.

The integration of operational technology and real-time data exchange with devices is crucial for enabling automation and robotics across industries.

Demand for versatile robots is accelerating due to the convergence of Information Technology (IT) and Operational Technology (OT).

Key characteristics (why it matters for STT)

CPU: 6-core ARM Cortex-A78AE
GPU: NVIDIA Ampere architecture (up to ~1024 CUDA cores)
AI performance: up to ~40 TOPS (INT8, depending on configuration)
Memory: 8 GB unified LPDDR5
Power envelope: ~7–15 W
Power supply: Requires an efficient and reliable power supply to ensure stable operation in robotics applications

This makes the Orin Nano Super extremely attractive for on-device speech-to-text (STT):

No cloud dependency
Low latency potential
Tight integration with robotics stacks (ROS 2, GPIO, sensors)

However, getting usable real-time STT is not automatic. The rest of this article walks through the trial-and-error path we took to reach a practical solution.

Real time Speech To Text for robots - Running whisper.cpp with CUDA on Jetson Orin Nano Super

2. First attempt: Whisper on CPU (Tensor / PyTorch)

What is Whisper?

Whisper is an advanced open-source speech recognition system developed by OpenAI. It leverages deep learning models trained on a vast amount of multilingual and multitask supervised data, enabling it to transcribe spoken language into text with high accuracy across many languages and accents. Whisper is designed to be robust to background noise and varied audio quality, making it particularly suitable for real-time speech-to-text applications in dynamic environments like robotics. Its versatility and performance have made it a popular choice for developers seeking reliable, on-device speech recognition without relying on cloud services.

The obvious first step was to run Whisper using CPU inference, either:

via PyTorch
or via Tensor abstractions on ARM

Various techniques are used in real time speech to text systems, including capturing sound waves with microphones and converting them into digital data for processing and recognition.

What worked

Transcription quality was correct
Setup was relatively simple

This initial approach also provided valuable research insights into the performance of CPU-based speech-to-text systems.

What failed

Massive latency
On the Orin Nano Super, long utterances resulted in 20–30 seconds delay
CPU usage went to 100%, starving the rest of the system

Example symptoms

User finishes speaking … 10s … … 20s … Transcription finally appears

This approach is not viable for:

conversational robots
interactive assistants
interruption-based dialogue systems

Conclusion: CPU-only Whisper on Orin Nano is fine for offline batch transcription, not for interaction.

3. Second attempt: NVIDIA Riva

Next, we evaluated NVIDIA Riva, NVIDIA’s official speech AI stack.

NVIDIA Riva provides extensive resources, including comprehensive documentation, tutorials, and community support, to help developers implement real time speech to text solutions.

On paper, Riva is ideal

GPU-accelerated ASR
Optimized for NVIDIA hardware
Production-grade models

Riva’s platform is also designed to support generative AI workloads, enabling advanced speech and language applications.

In practice, on Jetson

Heavy container stack
Large memory footprint
Complex configuration (Triton, gRPC, model repos)
Overkill for a single embedded robot

Practical issues encountered

Long setup time
Difficult debugging
Tight coupling with NVIDIA’s ecosystem
Hard to customize for experimental pipelines

Conclusion:
Riva is powerful, but too complex and heavyweight for an embedded robotics use case where you want:

transparency
control
fast iteration

4. The turning point: whisper.cpp with CUDA

The real breakthrough came from whisper.cpp.

Additionally, whisper.cpp can be integrated with robot controllers to enable speech-driven control in robotics applications.

Why whisper.cpp?

Written in C/C++
Minimal dependencies
Designed for performance
Supports CUDA acceleration
Excellent fit for embedded systems

The performance benefits of whisper.cpp with CUDA are a result of the deeply intertwined relationship between software optimization and hardware acceleration.

Unlike Python-based stacks, whisper.cpp gives you:

deterministic behavior
low overhead
full control over threading and memory

5. Compiling whisper.cpp with CUDA on Jetson

Prerequisites

JetPack installed (CUDA, cuBLAS available)
Enough swap space (important!)
Native build on the Jetson

Clone the repository

git clone https://github.com/ggerganov/whisper.cpp.git cd whisper.cpp

Configure the build

cmake -B build \\ 
-DGGML_CUDA=ON \\ 
-DCMAKE_BUILD_TYPE=Release

Compile

cmake --build build -- -j2

⚠️ Tip:
On Orin Nano, avoid -j6. Memory pressure can kill the build.
-j2 is slow but stable.

Successful build output (excerpt)

[ 96%] Built target whisper-cli [ 99%] Built target vad-speech-segments [100%] Built target whisper-server

At this point, CUDA support is enabled and verified.

6. Running Whisper with GPU acceleration

Download a model

./build/bin/whisper-cli \
  -m models/ggml-base.en.bin \
  -f samples/jfk.wav \
  --gpu

Basic GPU-accelerated transcription

./build/bin/whisper-cli \\ 
-m models/ggml-base.en.bin \\ 
-f samples/jfk.wav \\ --gpu

Observed improvements

GPU usage immediately visible (tegrastats)
CPU usage drops significantly
Latency reduced from tens of seconds to a few seconds
Much more stable under continuous usage

Low-Latency Speech-to-Text with `whisper-server` and Python

Compiling whisper.cpp with CUDA is only half of the story.
To actually achieve low-latency speech-to-text on Jetson, the recommended architecture is to run whisper-server as a persistent process and stream short audio chunks to it from a lightweight Python client.

This avoids:

model reload on every transcription
Python ↔ GPU overhead
unnecessary memory allocations

High-level architecture

Microphone → Python (audio capture)
          → HTTP request
          → whisper-server (CUDA)
          → Transcription result

1. Installing Python dependencies

On Jetson (Ubuntu 20.04 / 22.04), start by installing system dependencies:

sudo apt update sudo apt install -y \
  python3 \
  python3-pip \
  portaudio19-dev \
  ffmpeg

Then install Python packages:

pip3 install --user \
  sounddevice \
  numpy \
  requests

These libraries are intentionally minimal:

sounddevice: real-time microphone capture
numpy: audio buffering
requests: HTTP communication with whisper-server

2. Starting `whisper-server` with CUDA

From the whisper.cpp directory:

./build/bin/whisper-server \
  -m models/ggml-base.en.bin \
  --gpu \
  --port 8080

At this point:

the model is loaded once
CUDA kernels stay warm
the server is ready to accept audio chunks

3. Python client script (low latency)

Below is a minimal Python script that:

records short audio chunks
sends them to whisper-server
prints the transcription result


#!/usr/bin/env python3
"""
Low-latency Whisper client for whisper-server.

This script:
- captures short audio chunks from the microphone
- sends them to whisper-server via HTTP
- prints the transcription result

It is designed to keep latency low by:
- using small audio buffers
- avoiding any model loading in Python
"""

import sounddevice as sd
import numpy as np
import requests
import tempfile
import subprocess
import os

# ----------------------------
# Audio configuration
# ----------------------------

SAMPLE_RATE = 16000          # Whisper expects 16 kHz audio
CHANNELS = 1                # Mono audio
RECORD_SECONDS = 3          # Short chunk for low latency

# whisper-server endpoint
WHISPER_SERVER_URL = "http://127.0.0.1:8080/inference"


def record_audio():
    """
    Records audio from the default microphone and returns
    a NumPy float32 array.
    """
    print("Recording...")
    audio = sd.rec(
        int(RECORD_SECONDS * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype="float32"
    )
    sd.wait()
    print("Recording finished")
    return audio.squeeze()


def save_wav(audio, path):
    """
    Saves raw float32 audio to a WAV file using ffmpeg.
    This avoids Python WAV encoding overhead.
    """
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-f", "f32le",
            "-ar", str(SAMPLE_RATE),
            "-ac", str(CHANNELS),
            "-i", "pipe:0",
            path
        ],
        input=audio.tobytes(),
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )


def transcribe(audio):
    """
    Sends audio to whisper-server and returns the transcription text.
    """
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        wav_path = tmp.name

    try:
        save_wav(audio, wav_path)

        with open(wav_path, "rb") as f:
            response = requests.post(
                WHISPER_SERVER_URL,
                files={"file": f}
            )

        response.raise_for_status()
        return response.json()["text"]

    finally:
        os.remove(wav_path)


def main():
    """
    Main loop:
    - record
    - transcribe
    - print result
    """
    while True:
        audio = record_audio()
        text = transcribe(audio)
        print(f"> {text}")


if __name__ == "__main__":
    main()

4. Why this approach keeps latency low

This design works well on Jetson because:

whisper-server keeps the model resident in GPU memory
Python only handles audio I/O
Short chunks reduce end-to-end delay
CUDA execution stays saturated without blocking the CPU

In practice, this setup delivers:

seconds-level latency instead of tens of seconds
stable performance even under continuous speech
a clean separation between audio capture and inference

In the next article, this Python client will be extended with:

voice activity handling
interruption support
wake word logic

But even in its current form, this setup is already usable for real embedded interaction on Jetson.

7. AI Performance Optimization for Real-Time STT on Jetson

The integration of artificial intelligence into cyber physical systems has fundamentally changed how robots interact with the physical environment and with humans. In modern robotic systems—especially those designed for human interaction, such as humanoid robots—real-time Speech-To-Text (STT) is a cornerstone capability. It enables robots to understand spoken commands, respond naturally, and adapt to dynamic physical processes, making them more effective collaborators in environments ranging from smart factories to healthcare and civil infrastructure.

Optimizing AI performance for real-time STT on platforms like the Jetson Orin Nano Super is essential for bridging the digital and physical worlds. The key features of cyber physical systems—tight integration of computational elements with physical processes and control systems—allow robots to process sensor data, execute control algorithms, and interact with humans in real time. This seamless integration is what enables advanced speech recognition to become a reliable part of the robot’s control and decision-making loop.

To achieve high-performance, real-time STT, it’s crucial to balance the computational demands of AI models with the constraints of embedded systems. On Jetson, leveraging GPU acceleration and efficient software components like whisper.cpp ensures that speech recognition operates with low latency and high reliability, even as the robot manages other essential tasks such as path planning, sensor fusion, and process control. Careful resource allocation, model selection, and system-level integration are key strategies for maintaining both AI performance and overall system efficiency.

Ultimately, optimizing AI for real-time speech recognition is not just about speed—it’s about enabling robots and engineered systems to interact intelligently and safely with the physical world. As CPS technologies continue to evolve, the ability to deliver responsive, accurate, and robust speech-driven control will remain a key differentiator for next-generation robotic systems.

7. Why this approach works on Jetson

Aspect	CPU Whisper	Riva	whisper.cpp + CUDA
Latency	❌ High	✅ Low	✅ Low
Complexity	✅ Low	❌ High	✅ Moderate
Control	❌ Limited	❌ Limited	✅ Full
Embedded-friendly	❌	❌	✅
Debuggability	❌	❌	✅

whisper.cpp with CUDA hits the sweet spot:

fast enough
simple enough
transparent enough

8. Key takeaways

The Jetson Orin Nano Super can handle real-time STT
CPU-only Whisper is not suitable for interaction
NVIDIA Riva is powerful but too heavy for many robotics projects
whisper.cpp compiled locally with CUDA is the most practical solution

This setup becomes a solid foundation for:

conversational robots
offline voice assistants
speech-driven control systems
adoption of real-time speech-to-text technology across various industries, including the automotive industry, to enhance flexibility, safety, and efficiency
the growing use of collaborative robots, with industry standards for safety and security being crucial for effective human-robot collaboration
implementing robotics and automation as a key strategy for employers to address labor shortages and improve workplace efficiency
deploying robots that use artificial intelligence to work independently, with a focus on ensuring safety and security as robots increasingly operate alongside humans
enabling new forms of human-robot interaction through the integration of AI in robotics, transforming how robots interact with their environments and with humans

What’s next?

In the next article, we will build on this foundation and cover:

voice activity handling
interruption-friendly dialogue
wake word integration

Future articles will focus on the broader CPS program, including the integration of other devices such as sensors and actuators. CPS integrates computational elements with physical processes to enable intelligent control, enabling interaction between digital and physical components and driving advancements in various sectors. CPS operates in real time, responds to changes in the physical environment with minimal delay, and relies on data collected from sensors to make informed decisions. In smart factories, AI-driven robotics can autonomously anticipate failures before they occur, and the convergence of Information Technology (IT) and Operational Technology (OT) enhances robotics versatility.

But that’s a story for another deep dive.

If you are building robotics or CPS systems on Jetson, mastering low-level tools like whisper.cpp is often the difference between a demo and a deployable system.