Running Piper TTS in ROS 2 on NVIDIA Jetson Orin Nano with Very Low Latency

A practical, low-level guide from theory to production on Jetson

Text-to-Speech (TTS) looks simple on paper: you provide text, you get audio.

In practice, especially on embedded hardware, inside containers, and integrated into a ROS 2 system, achieving low latency, stable audio, and reproducible behavior is surprisingly difficult.

In this article, I explain how we successfully integrated Piper TTS into ROS 2 with sub-second perceived latency, running on an NVIDIA Jetson Orin Nano, inside an Isaac ROS container, using CUDA acceleration and PulseAudio.

This is not just a recipe.

We will:

  • explain what “low-latency TTS” really means,
  • define the key technical terms,
  • walk through the trial-and-error process,
  • explain why naïve solutions fail,
  • provide complete commands and parameters,
  • end with a robust, reboot-safe solution.

A future article will dive deeper into the full ROS audio architecture. Here, we focus on TTS as a standalone, production-grade ROS node.


Introduction to ROS Robot Operating System

The Robot Operating System (ROS) is a powerful meta operating system designed specifically for building and managing autonomous systems that interact with the physical world. As a comprehensive software framework, ROS provides a vast collection of software libraries, tools, and conventions that simplify the development of complex robot applications. Whether you’re working on self-driving cars, autonomous vehicles, or industrial robots, ROS enables your systems to sense their environment, process data, and respond intelligently to real-world situations.

One of the key strengths of ROS is its modularity and extensive community support, which accelerates development and fosters innovation. Developers can leverage a wide range of pre-built packages and drivers, allowing them to focus on the unique aspects of their robotics project rather than reinventing the wheel. This collaborative ecosystem makes ROS the backbone of many cutting-edge robot applications, empowering systems to perceive, interact with, and respond to the dynamic physical world around them.

By integrating ROS into your development workflow, you gain access to a robust platform that supports the creation of scalable, maintainable, and high-performance autonomous systems, enabling your robots to operate effectively in real-world environments.


Overview of Piper TTS

Piper TTS is an advanced text-to-speech system designed to generate natural, expressive speech from written text, supporting multiple languages and a variety of voices. Its sophisticated speech synthesis capabilities make it an ideal choice for developers aiming to enhance the accessibility and user experience of their autonomous systems. Piper TTS allows for fine-tuning of speech parameters such as pitch and rate, and can export audio files for offline playback or sharing, making it versatile for a range of applications.

By integrating Piper TTS, autonomous systems can communicate more effectively with humans, providing spoken feedback, delivering instructions, or responding to queries in real time. This not only improves accessibility for users but also enables more natural and intuitive interactions between humans and machines. With support for multiple languages and voices, Piper TTS empowers developers to build systems that can operate in diverse environments and cater to a global audience, further expanding the capabilities of modern autonomous systems.


Physical AI Model Integration

Integrating physical AI models with real world sensor data is a critical step in building autonomous systems that can truly perceive, understand, and interact with the physical world. By combining AI models with sensor data, such as images, videos, and other multimodal inputs, developers enable their systems to process complex information from their environment and make informed decisions in real time.

This approach is essential for applications operating in complex environments, from self-driving cars navigating busy streets to industrial robots adapting to dynamic factory floors. Physical AI model integration allows autonomous systems to interpret sensor data, recognize spatial relationships, and respond appropriately to new challenges. Leveraging ROS and other software libraries, developers can create flexible and powerful systems that not only process data efficiently but also adapt to changing conditions in the physical world.

By harnessing the synergy between physical AI, real world sensor data, and robust software frameworks, developers can build next-generation autonomous systems that operate, interact, and respond intelligently in a wide range of real-world scenarios.


Setting up the Environment

Establishing a robust environment for physical AI development is crucial for building and testing autonomous systems that interact with the physical world. This process begins with configuring essential software components such as ROS, NVIDIA Omniverse, and other powerful developer tools. Installing the right software libraries ensures that your system can handle the processing and analysis of sensor data efficiently.

A well-structured setup includes preparing the simulation environment, which allows for safe and repeatable testing before deploying to real hardware. Developers must also configure the sensor data processing pipeline, taking into account the computational power required for real-time operation. By carefully planning the environment, considering factors like data throughput, hardware compatibility, and the specific needs of the autonomous system, developers can streamline the development process and accelerate the path from prototype to deployment.

Leveraging simulation tools and robust software frameworks not only enhances development efficiency but also ensures that your physical AI system is ready to operate reliably in real-world environments.


Integrating Piper TTS with Other Components

Integrating Piper TTS with other components, such as ROS and physical AI models, unlocks new possibilities for creating autonomous systems that can communicate and interact naturally with humans. This integration involves configuring Piper TTS to work seamlessly with the broader software stack, ensuring smooth data exchange and real-time responsiveness across all modules.

By combining Piper TTS with physical AI and ROS, developers can build systems that provide spoken feedback, respond to voice commands, and interact with users in a more intuitive way. This is especially valuable in applications like self-driving cars, robotics, and industrial automation, where effective human-machine interaction is essential for safety and efficiency. The ability to process data from the physical world, interpret it using AI models, and deliver clear, natural speech responses enables autonomous systems to operate more intelligently and adaptively.

Through thoughtful integration, Piper TTS becomes a key component in building autonomous systems that not only perceive and act in the physical world but also engage with humans in meaningful, accessible ways.

1. The Goal: What “Low-Latency TTS” Really Means

When people say “low-latency TTS”, they often mean:

  • “The model is fast”
  • “The audio file is generated quickly”

For robotics, this definition is wrong.

What we actually care about is:

The time between publishing a ROS message and hearing the first audible sound.

That single metric defines perceived responsiveness.

Why this matters in robotics

  • A 3–5 second delay feels broken to users.
  • Humans are extremely sensitive to response timing.
  • Variable latency is worse than consistently slow latency.
  • Voice interaction must feel interruptible and alive.

Our goal is not maximum throughput, but minimal and predictable time-to-first-audio.


2. Context: Hardware & Software Environment

Understanding the constraints matters as much as writing code.

Hardware

  • NVIDIA Jetson Orin Nano
  • ARM64 CPU
  • Integrated NVIDIA GPU
  • Limited power and thermal budget

Software stack

  • Ubuntu (JetPack 6.x)
  • Isaac ROS Dev Container
  • ROS 2 Humble
  • Piper TTS (ONNX)
  • PulseAudio
  • CUDA via onnxruntime-gpu

Why this is a hard combination

  • Containers complicate audio routing
  • Jetson requires ARM64-specific CUDA wheels
  • CUDA initialization is lazy and expensive
  • ROS nodes must behave deterministically
  • USB audio devices are not stable across reboots

3. Piper TTS: What It Is and How It Works

3.1 What Piper Actually Does

Piper is a neural text-to-speech engine built on ONNX models.

At runtime, Piper performs:

  1. Text normalization and tokenization

  2. Neural inference (encoder + vocoder)

  3. Generation of raw PCM audio samples

Important detail:

Piper does not play audio.

It outputs raw PCM data (e.g. 16 kHz, signed 16-bit, mono) to stdout.

You must handle:

  • audio playback,

  • buffering,

  • device routing,

  • latency control.

This separation is powerful, but dangerous if misused.


3.2 Piper CLI vs “Server Mode”

Unlike projects such as whisper.cpp, Piper does not ship with:

  • an HTTP server,

  • a gRPC daemon,

  • a persistent inference service.

The canonical usage model is:

text → stdin → piper → stdout (raw audio)

This means:

  • every invocation is stateless,

  • process lifecycle is your responsibility,

  • latency depends heavily on how you integrate it.


4. Where Latency Comes From (Theory)

Before optimizing anything, we must understand where time is actually lost.

4.1 The Main Sources of Latency

1. Process startup

Starting a Python or native process is slow, especially inside containers.

2. Model loading

ONNX models are large and expensive to initialize.

3. CUDA lazy initialization

The first inference:

  • loads CUDA kernels,

  • allocates GPU memory,

  • builds execution graphs.

This can take seconds, even if later inferences are fast.

4. Audio buffering

PulseAudio defaults prioritize stability, not latency.

5. ROS scheduling

Callbacks, queues, and thread contention add jitter.


4.2 The Worst (But Very Common) Design

Many first implementations look like this:

ROS message
   ↓
start piper process
   ↓
load model
   ↓
initialize CUDA
   ↓
generate full audio
   ↓
start audio player
   ↓
play

This design pays all costs every time.

Result:

  • 4–10 seconds of latency

  • massive jitter

  • unusable interaction


5. Key Design Decision: Persistent Processes

The single most important decision we made:

Start Piper once. Start the audio player once. Never restart them unless absolutely necessary.

Why this changes everything

  • CUDA initializes once

  • Model stays resident in GPU memory

  • Audio pipeline stays hot

  • Latency becomes predictable

This principle dominates all others.


6. High-Level Architecture Overview

Conceptually, the final architecture looks like this:

ROS topic (/tts/say)
        ↓
   TTS ROS Node
        ↓
  (persistent)
     Piper
        ↓  raw PCM stream
  (persistent)
     pacat
        ↓
   PulseAudio
        ↓
   Speaker / USB headset

Key properties:

  • No temporary files

  • No process restarts

  • Streaming audio, not batch audio

  • Clear ownership of each layer


7. Audio Output: Why PulseAudio + pacat

7.1 Why Not ALSA Directly?

ALSA works, but:

  • device handling in containers is painful,

  • USB devices change indices,

  • routing is brittle.

PulseAudio gives us:

  • dynamic device management,

  • per-sink routing,

  • compatibility with desktop environments.


7.2 Why pacat?

pacat is PulseAudio’s raw audio client.

It:

  • reads raw PCM from stdin,

  • supports explicit buffering control,

  • is lightweight and predictable.

We explicitly configure:

  • sample rate (16000 Hz)

  • format (s16le)

  • channel count (1)

  • latency and process time

This is mandatory to avoid pops, noise, and excessive buffering.


8. Warmup: The Trick That Changed Everything

8.1 The Problem

Even with persistent processes, the first real sentence still took ~4 seconds.

Why?

  • CUDA kernels are initialized lazily

  • ONNX Runtime does heavy work on first inference


8.2 The Solution: Silent, Blocking Warmup

At node startup:

  1. Send a warmup sentence

  2. Let Piper perform a real inference

  3. Wait for first PCM bytes

  4. Discard the audio

  5. Mark the node as ready

Conceptually:

startup
  ↓
send warmup text
  ↓
wait for first PCM bytes
  ↓
discard audio
  ↓
READY

Result:

  • warmup cost is paid once,

  • user never hears it,

  • first real sentence is fast.


9. Interrupt Handling Without Killing Latency

Robots must speak while things happen.

We want:

  • new messages to interrupt old ones,

  • no pops or glitches,

  • no re-cold-start.

What failed

  • restarting Piper on every interrupt
  • flushing too much audio
  • aggressive queue clearing

Final strategy

  • Piper stays alive
  • pacat stays alive
  • only drop already-queued audio
  • small, bounded drop window (~150–300 ms)

This preserves responsiveness and stability.


10. CUDA Acceleration: What Actually Works on Jetson

10.1 The Common Trap

Installing ONNX Runtime via pip:

pip install onnxruntime

Gives you:

CPUExecutionProvider

Even if CUDA is present.

--cuda then silently does nothing.


10.2 The Correct Solution on Jetson (ARM64)

We must:

  1. Remove CPU-only ONNX Runtime

  2. Install a prebuilt aarch64 CUDA wheel

  3. Verify providers explicitly

Example:

python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"

Expected output:

['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Only then does Piper actually use the GPU.


11. Installation: Commands That Actually Work

11.1 System dependencies

sudo apt-get update || true
sudo apt-get install -y \
  pulseaudio-utils \
  alsa-utils

11.2 Install Piper

python3 -m pip install --user piper-tts==1.3.0 pathvalidate

11.3 Install ONNX Runtime GPU (Jetson)

python3 -m pip uninstall -y onnxruntime onnxruntime-gpu

python3 -m pip install \
  https://github.com/ultralytics/assets/releases/download/v0.0.0/onnxruntime_gpu-1.23.0-cp310-cp310-linux_aarch64.whl

Verify:

python3 - << 'PY'
import onnxruntime as ort
print("ORT:", ort.__version__)
print("providers:", ort.get_available_providers())
PY 

12. Making It Survive Reboots and Containers

12.1 Audio sinks are not stable

After reboot:

  • default sink may change
  • USB headsets may disappear

Always pass pulse_sink explicitly.


12.2 Containers are ephemeral

Everything must be:

  • installed by a bootstrap script,
  • idempotent,
  • verified at startup.

That includes:

  • Piper
  • onnxruntime-gpu
  • audio utilities
  • environment variables

13. Running the ROS Node (Final Parameters)

Here is the final Python script with comments:

#!/usr/bin/env python3
"""
Piper TTS ROS 2 Node (baseline version)

This node subscribes to a ROS 2 topic (/tts/say) and speaks incoming text using:
  - Piper CLI (TTS inference) producing RAW PCM audio on stdout
  - pacat (PulseAudio client) playing RAW PCM audio to the configured sink

Important design note:
  This "baseline" implementation spawns new processes (piper + pacat) for each chunk.
  It is simple and easy to reason about, but it can be slow at the beginning because:
    - Starting processes costs time
    - Loading the model costs time
    - Initializing CUDA (if enabled) can cost seconds on first use (lazy init)
  In the low-latency "production" version, we keep processes persistent to avoid repeated startup cost.
"""

import os
import re
import shutil
import subprocess
import threading
import time
from dataclasses import dataclass
from queue import Queue, Empty
from typing import Optional, List

import rclpy
from rclpy.node import Node
from std_msgs.msg import String


# -----------------------------
# Utility helpers
# -----------------------------

def _now_ms() -> int:
    """Return current time in milliseconds (useful for latency measurements)."""
    return int(time.time() * 1000)


def _which_or_raise(cmd: str) -> str:
    """
    Resolve an executable in PATH (like the Unix `which` command).
    We fail early if a required command is not installed (piper or pacat).
    """
    p = shutil.which(cmd)
    if not p:
        raise FileNotFoundError(f"Command not found in PATH: {cmd}")
    return p


def _chunk_text(text: str, max_chars: int) -> List[str]:
    """
    Split long text into smaller chunks for better responsiveness.

    Why chunking helps:
      - Many TTS engines (including Piper) begin synthesis only after receiving a full sentence.
      - Very long messages can delay the first spoken words.
      - Chunking the message allows you to begin speaking sooner (perceived latency).

    Implementation:
      - Normalize whitespace.
      - Prefer splitting on sentence punctuation (. ! ? : ;)
      - If chunks are still too large, hard-split to max_chars.
    """
    text = re.sub(r"\s+", " ", text).strip()
    if not text:
        return []

    if len(text) <= max_chars:
        return [text]

    # Split by punctuation boundaries where possible
    parts = re.split(r"(?<=[\.\!\?\:\;])\s+", text)

    chunks: List[str] = []
    buf = ""
    for p in parts:
        p = p.strip()
        if not p:
            continue

        if not buf:
            buf = p
        elif len(buf) + 1 + len(p) <= max_chars:
            buf = buf + " " + p
        else:
            chunks.append(buf)
            buf = p

    if buf:
        chunks.append(buf)

    # If any chunk is still too large, split hard by max_chars
    out: List[str] = []
    for c in chunks:
        if len(c) <= max_chars:
            out.append(c)
        else:
            for i in range(0, len(c), max_chars):
                out.append(c[i:i + max_chars].strip())

    return [c for c in out if c]


# -----------------------------
# Job definition for the worker queue
# -----------------------------

@dataclass
class Job:
    """
    Represents one TTS chunk to speak.

    enqueue_ms is useful for measuring queueing delay (e.g. how long messages wait before speaking).
    """
    text: str
    enqueue_ms: int


# -----------------------------
# ROS 2 Node: PiperTTS
# -----------------------------

class PiperTTS(Node):
    """
    ROS 2 node that listens on /tts/say (std_msgs/String) and speaks text using Piper.

    This baseline version is intentionally simple:
      - Each chunk causes a new Piper process + a new pacat process.
      - Piper produces raw PCM to stdout.
      - pacat consumes that raw PCM and plays it via PulseAudio.

    Downsides:
      - Process creation overhead every time
      - Potentially large "first sentence" latency due to model/CUDA initialization
    """

    def __init__(self):
        super().__init__("piper_tts")

        # ---- ROS Parameters ----
        #
        # We expose most knobs as ROS parameters so we can tune them at runtime
        # without changing code.
        #
        # model_path:
        #   Path to the .onnx Piper model file.
        #
        # piper_cmd:
        #   Name/path of the Piper executable (defaults to "piper" in PATH).
        #
        # use_cuda_flag:
        #   If true, we pass --cuda to Piper (requires onnxruntime-gpu with CUDA provider).
        #
        # max_chars:
        #   Maximum size of each chunk after splitting.
        #
        self.declare_parameter("model_path", "")
        self.declare_parameter("piper_cmd", "piper")
        self.declare_parameter("use_cuda_flag", False)
        self.declare_parameter("max_chars", 120)

        # interrupt:
        #   If true, when a new message arrives we stop any current playback immediately.
        #
        # drop_old:
        #   If true, when a new message arrives we drop any queued chunks that weren't spoken yet.
        #
        self.declare_parameter("interrupt", True)
        self.declare_parameter("drop_old", True)

        # Piper CLI knobs (speech characteristics):
        # - length_scale: speaking rate (smaller -> faster, larger -> slower)
        # - noise_scale / noise_w_scale: voice variability (affects naturalness)
        # - sentence_silence: silence between sentences (in seconds)
        # - no_normalize: disable normalization (can be faster, depends on use case)
        self.declare_parameter("length_scale", 1.0)
        self.declare_parameter("noise_scale", 0.667)
        self.declare_parameter("noise_w_scale", 0.8)
        self.declare_parameter("sentence_silence", 0.0)
        self.declare_parameter("no_normalize", True)

        # PulseAudio / pacat parameters:
        # - pulse_sink: name of PulseAudio sink (output device) to use
        # - pulse_server: pulse server path, e.g. unix:/run/user/1000/pulse/native (common in containers)
        # - pacat_latency_msec, pacat_process_time_msec: buffering knobs (lower -> less latency but more risk of glitches)
        self.declare_parameter("pulse_sink", "")
        self.declare_parameter("pulse_server", "")
        self.declare_parameter("pacat_latency_msec", 20)
        self.declare_parameter("pacat_process_time_msec", 10)

        # Warmup:
        # Many systems pay initialization cost on first run.
        # Warmup runs a short synthesis once at startup so later speech starts faster.
        self.declare_parameter("warmup", True)
        self.declare_parameter("warmup_text", "hello.")

        # ---- Read parameters from ROS 2 ----
        self.model_path = self.get_parameter("model_path").get_parameter_value().string_value
        if not self.model_path:
            raise ValueError("model_path parameter is required")

        self.piper_cmd = self.get_parameter("piper_cmd").get_parameter_value().string_value
        self.piper_bin = _which_or_raise(self.piper_cmd)

        self.use_cuda_flag = self.get_parameter("use_cuda_flag").get_parameter_value().bool_value
        self.max_chars = int(self.get_parameter("max_chars").get_parameter_value().integer_value)

        self.interrupt = self.get_parameter("interrupt").get_parameter_value().bool_value
        self.drop_old = self.get_parameter("drop_old").get_parameter_value().bool_value

        self.length_scale = float(self.get_parameter("length_scale").get_parameter_value().double_value)
        self.noise_scale = float(self.get_parameter("noise_scale").get_parameter_value().double_value)
        self.noise_w_scale = float(self.get_parameter("noise_w_scale").get_parameter_value().double_value)
        self.sentence_silence = float(self.get_parameter("sentence_silence").get_parameter_value().double_value)
        self.no_normalize = self.get_parameter("no_normalize").get_parameter_value().bool_value

        self.pulse_sink = self.get_parameter("pulse_sink").get_parameter_value().string_value
        self.pulse_server = self.get_parameter("pulse_server").get_parameter_value().string_value
        self.pacat_latency = int(self.get_parameter("pacat_latency_msec").get_parameter_value().integer_value)
        self.pacat_proc = int(self.get_parameter("pacat_process_time_msec").get_parameter_value().integer_value)

        self.warmup = self.get_parameter("warmup").get_parameter_value().bool_value
        self.warmup_text = self.get_parameter("warmup_text").get_parameter_value().string_value or "hello."

        # We require pacat for playback
        self.pacat_bin = _which_or_raise("pacat")

        # ---- Worker infrastructure ----
        #
        # We decouple ROS callbacks (which should be fast) from TTS synthesis/playback (which is slow)
        # using a background worker thread + a queue.
        #
        self._q: "Queue[Job]" = Queue()
        self._stop = threading.Event()

        # Track current playback process so we can interrupt it
        self._current_lock = threading.Lock()
        self._current_proc: Optional[subprocess.Popen] = None

        # Subscribe to the input text topic
        self.create_subscription(String, "/tts/say", self.on_say, 10)

        # Log configuration to make debugging easier
        self.get_logger().info(f"Piper model: {self.model_path}")
        self.get_logger().info(
            "Audio out: pacat "
            f"(sink={self.pulse_sink or 'default'}) "
            f"rate=16000Hz ch=1 fmt=s16le "
            f"lat={self.pacat_latency}ms proc={self.pacat_proc}ms"
        )
        self.get_logger().info(
            "Synth: "
            f"length_scale={self.length_scale} "
            f"noise_scale={self.noise_scale} "
            f"noise_w_scale={self.noise_w_scale} "
            f"sentence_silence={self.sentence_silence} "
            f"no_normalize={self.no_normalize}"
        )

        # Start the worker thread
        self._worker = threading.Thread(target=self._run, daemon=True)
        self._worker.start()

        # Optional warmup step
        if self.warmup:
            try:
                t0 = _now_ms()
                self._warmup()
                self.get_logger().info(f"Warmup ok ({_now_ms() - t0}ms)")
            except Exception as e:
                self.get_logger().warn(f"Warmup failed: {repr(e)}")

        self.get_logger().info("PiperTTS ready: subscribe /tts/say")

    def destroy_node(self):
        """
        Called when the node is shutting down.
        We signal the worker to stop and terminate any current playback process.
        """
        self._stop.set()
        try:
            self._interrupt_current()
        except Exception:
            pass
        super().destroy_node()

    def on_say(self, msg: String):
        """
        ROS subscription callback: triggered whenever a message arrives on /tts/say.

        Important:
          Keep this callback fast. Don't do heavy work here.
          We only:
            - optionally drop queued messages
            - optionally interrupt current playback
            - chunk the text
            - enqueue jobs for the background worker
        """
        text = (msg.data or "").strip()
        if not text:
            return

        # Drop any queued but not spoken items if requested
        if self.drop_old:
            while True:
                try:
                    self._q.get_nowait()
                except Empty:
                    break

        # Interrupt current playback if requested
        if self.interrupt:
            self._interrupt_current()

        # Split into chunks for quicker perceived responsiveness
        chunks = _chunk_text(text, self.max_chars)

        # Enqueue chunks for worker
        now = _now_ms()
        for c in chunks:
            self._q.put(Job(text=c, enqueue_ms=now))

    def _interrupt_current(self):
        """
        Stop whatever is currently playing by terminating the pacat process.

        Note:
          In this baseline design, pacat only exists during playback of a chunk.
          Terminating pacat effectively stops audio output quickly.
        """
        with self._current_lock:
            if self._current_proc and self._current_proc.poll() is None:
                try:
                    self._current_proc.terminate()
                except Exception:
                    pass
            self._current_proc = None

    def _env_audio(self) -> dict:
        """
        Build environment variables for subprocesses.

        In container setups, PulseAudio is often accessed through a unix socket,
        so we pass PULSE_SERVER explicitly.

        PULSE_SINK selects a specific output device (e.g., USB headset) instead of "default".
        """
        env = os.environ.copy()
        if self.pulse_server:
            env["PULSE_SERVER"] = self.pulse_server
        if self.pulse_sink:
            env["PULSE_SINK"] = self.pulse_sink
        return env

    def _piper_args(self) -> List[str]:
        """
        Construct the Piper CLI arguments.

        --output-raw:
          Piper writes raw PCM (s16le mono 16kHz by default for many models).
          Raw output is ideal for piping into pacat.

        --cuda (optional):
          Requests CUDAExecutionProvider (requires correct onnxruntime-gpu build).
        """
        args = [
            self.piper_bin,
            "-m", self.model_path,
            "--output-raw",
            "--length-scale", str(self.length_scale),
            "--noise-scale", str(self.noise_scale),
            "--noise-w-scale", str(self.noise_w_scale),
            "--sentence-silence", str(self.sentence_silence),
        ]
        if self.no_normalize:
            args.append("--no-normalize")
        if self.use_cuda_flag:
            args.append("--cuda")
        return args

    def _pacat_args(self) -> List[str]:
        """
        Construct the pacat arguments.

        We explicitly tell pacat:
          --raw: input is raw PCM
          --format=s16le: signed 16-bit little endian
          --channels=1: mono
          --rate=16000: 16kHz sample rate

        latency and process-time tune PulseAudio buffering:
          - Lower values => lower latency but higher risk of underruns/glitches.
          - Higher values => more stable but increased latency.
        """
        return [
            self.pacat_bin,
            "--raw",
            "--format=s16le",
            "--channels=1",
            "--rate=16000",
            f"--latency-msec={self.pacat_latency}",
            f"--process-time-msec={self.pacat_proc}",
        ]

    def _warmup(self):
        """
        Run a quick synthesis at startup.

        WARNING:
          Piper stdout is raw audio bytes, not UTF-8 text.
          Never decode stdout as text.

        We write warmup_text to Piper stdin, then read a small amount of stdout
        to ensure synthesis started.
        """
        piper = subprocess.Popen(
            self._piper_args(),
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            env=self._env_audio(),
        )
        assert piper.stdin and piper.stdout

        piper.stdin.write((self.warmup_text.strip() + "\n").encode("utf-8"))
        piper.stdin.close()

        # Read some bytes to trigger model execution (warmup effect)
        _ = piper.stdout.read(8192)

        # Wait for clean exit
        piper.wait(timeout=60)

        if piper.returncode != 0:
            err = b""
            try:
                err = piper.stderr.read() if piper.stderr else b""
            except Exception:
                pass
            raise RuntimeError(
                f"warmup piper rc={piper.returncode}: {err.decode('utf-8', errors='ignore')[-800:]}"
            )

    def _speak_stream(self, text: str) -> int:
        """
        Speak one chunk by streaming Piper output directly into pacat.

        Pipeline:
          piper stdout (raw PCM) -> pacat stdin -> PulseAudio -> speaker

        Returns:
          first_audio_ms (approx) = time to start both processes.
          Note: this baseline value is optimistic; it doesn't measure actual "first audible sample".
        """
        t0 = _now_ms()
        env = self._env_audio()

        # Start Piper (TTS)
        piper = subprocess.Popen(
            self._piper_args(),
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            env=env,
        )
        assert piper.stdin and piper.stdout

        # Start pacat (playback) and feed it Piper's stdout
        pacat = subprocess.Popen(
            self._pacat_args(),
            stdin=piper.stdout,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.PIPE,
            env=env,
        )

        # Track current playback process so we can interrupt it
        with self._current_lock:
            self._current_proc = pacat

        # Send the text to Piper via stdin
        piper.stdin.write((text.strip() + "\n").encode("utf-8"))
        piper.stdin.close()

        # Approx: pipeline started once processes launched
        first_audio_ms = _now_ms() - t0

        # Wait for playback to complete
        pacat_rc = pacat.wait()
        piper_rc = piper.wait()

        # Error handling: if piper fails, show stderr tail
        if piper_rc != 0:
            err = b""
            try:
                err = piper.stderr.read() if piper.stderr else b""
            except Exception:
                pass
            raise RuntimeError(
                f"piper failed rc={piper_rc}: {err.decode('utf-8', errors='ignore')[-800:]}"
            )

        # If pacat fails, show its stderr tail
        if pacat_rc != 0:
            perr = b""
            try:
                perr = pacat.stderr.read() if pacat.stderr else b""
            except Exception:
                pass
            raise RuntimeError(
                f"pacat failed rc={pacat_rc}: {perr.decode('utf-8', errors='ignore')[-800:]}"
            )

        return first_audio_ms

    def _run(self):
        """
        Worker thread main loop.

        The worker waits for queued TTS jobs and speaks them sequentially.
        Using a worker thread avoids blocking ROS callbacks.
        """
        while not self._stop.is_set():
            try:
                job = self._q.get(timeout=0.1)
            except Empty:
                continue

            total_t0 = _now_ms()
            try:
                first_audio = self._speak_stream(job.text)
                total = _now_ms() - total_t0
                self.get_logger().info(
                    f"TTS chunk done | first_audio={first_audio}ms total={total}ms | "
                    f"'{job.text[:60]}{'...' if len(job.text) > 60 else ''}'"
                )
            except Exception as e:
                self.get_logger().error(f"TTS worker error: {repr(e)}")
            finally:
                with self._current_lock:
                    self._current_proc = None


def main():
    """
    Standard ROS 2 Python entrypoint.
    We init ROS, create the node, spin, and clean up on shutdown.
    """
    rclpy.init()
    node = None
    try:
        node = PiperTTS()
        rclpy.spin(node)
    except KeyboardInterrupt:
        pass
    finally:
        if node is not None:
            try:
                node.destroy_node()
            except Exception:
                pass
        try:
            rclpy.shutdown()
        except Exception:
            pass


if __name__ == "__main__":
    main()

The command to launch the ROS 2.0 node with the right parameters, of course, you need to adapt it for you (pulse_sink, model_path):

ros2 run robot_assistant piper_tts --ros-args \
  -p model_path:=/home/admin/ros2_ws/models/piper/en_US-amy-low.onnx \
  -p pulse_server:=unix:/run/user/1000/pulse/native \
  -p pulse_sink:=alsa_output.usb-0b0e_Jabra_SPEAK_410_USB_08C8C2AE4A84x011200-00.analog-stereo \
  -p use_cuda_flag:=true \
  -p max_chars:=120 \
  -p interrupt:=true \
  -p drop_old:=true \
  -p warmup:=true \
  -p warmup_text:="This is a warmup sentence that should never be heard." \
  -p pacat_latency_msec:=60 \
  -p pacat_process_time_msec:=30 \
  -p read_chunk_bytes:=2048 \
  -p inter_utterance_silence_ms:=40 \
  -p drop_audio_after_interrupt_ms:=250 \
  -p restart_piper_on_interrupt:=false

Here is the parameters description for you to really understand the optimization applied to avoid latency:

Model and Inference Parameters

model_path

Path to the Piper ONNX model used for speech synthesis.

  • This file defines the voice, language, and quality level.

  • “Low” models (e.g. *-low.onnx) are smaller and typically faster, making them well suited for real-time robotics.

Impact on latency: Very high. Model size directly affects inference speed and memory usage.


use_cuda_flag

Enables GPU acceleration by passing the --cuda flag to Piper.

  • Requires onnxruntime-gpu with a working CUDAExecutionProvider.

  • Without the correct runtime, this flag is silently ignored.

Impact on performance:
Can significantly reduce inference time, but does not eliminate first-run latency (CUDA initialization is lazy).


Audio Routing (PulseAudio)

pulse_server

Explicitly specifies the PulseAudio server to connect to.

  • In containerized environments, PulseAudio usually runs on the host.

  • The container accesses it through a UNIX socket.

Example:

unix:/run/user/1000/pulse/native

Why this matters:
Without this, audio clients inside the container may fail to connect, resulting in no sound without errors.


pulse_sink

Selects the exact PulseAudio output device (sink).

  • Example: a USB headset or external speaker.

  • Avoids relying on the “default” sink, which may change after reboot or USB reconnection.

Impact:
Critical for reproducibility and “works-after-reboot” behavior.


Text Handling and Responsiveness

max_chars

Maximum number of characters per speech chunk.

  • Long messages are split into smaller chunks before synthesis.

  • Smaller chunks allow speech to start earlier.

Trade-off:

  • Smaller value → lower perceived latency, more interruptions

  • Larger value → smoother prosody, but slower start on long messages

Typical range: 80–200 characters


interrupt

If enabled, any new incoming message interrupts the current speech.

  • Allows “barge-in” behavior.

  • Essential for interactive robots.

If disabled:
The robot always finishes speaking before responding to new input.


drop_old

If enabled, queued but not yet spoken messages are discarded when a new message arrives.

  • Prevents outdated speech from being played.

  • Ensures the robot always speaks the most recent intent.

Recommended for robots: true


Warmup (Cold-Start Mitigation)

warmup

Runs a warmup synthesis when the node starts.

  • Forces model loading and CUDA initialization early.

  • Prevents a long delay on the first real sentence.

If disabled:
The first spoken sentence may take several seconds to start.


warmup_text

Text used during the warmup phase.

  • Should be long enough to trigger real inference.

  • Audio should be discarded or inaudible.

Purpose:
Pay the expensive initialization cost once, before the robot interacts with users.


Audio Buffering and Latency Control

pacat_latency_msec

Controls the target latency buffer in pacat / PulseAudio.

  • Lower values → lower latency, higher risk of underruns

  • Higher values → more stable audio, increased latency

Typical values: 40–80 ms
Your value (60 ms) is a balanced choice for Jetson + USB audio.


pacat_process_time_msec

Controls how often PulseAudio processes audio chunks.

  • Smaller values → more responsive, higher scheduling pressure

  • Larger values → more stable, more buffering

Usually tuned together with pacat_latency_msec.


Streaming and Chunk Timing

read_chunk_bytes

Size of audio chunks read from Piper and forwarded to pacat.

  • Smaller chunks can reduce time-to-first-audio.

  • Larger chunks reduce CPU overhead.

Typical values: 1024–4096 bytes


inter_utterance_silence_ms

Artificial silence inserted between chunks or sentences.

  • Prevents clicks and harsh transitions.

  • Improves intelligibility when aggressive chunking is used.

Trade-off:

  • Smaller → faster, but risk of audio artifacts

  • Larger → smoother, slightly less responsive


Interrupt Cleanup and Stability

drop_audio_after_interrupt_ms

Defines how long audio is discarded after an interrupt.

  • Prevents hearing leftover buffered audio from the previous sentence.

  • Acts as a “flush window” for PulseAudio buffers.

Too small: old audio leaks through
Too large: next sentence may be clipped


restart_piper_on_interrupt

Controls whether Piper is fully restarted on interruption.

  • true → clean state, but very high latency

  • false → keep Piper alive, CUDA stays initialized

For low-latency systems: false is strongly recommended.


Summary: Which Parameters Matter Most
  • Startup latency: warmup, warmup_text, use_cuda_flag

  • Perceived responsiveness: max_chars, interrupt

  • Audio stability: pacat_latency_msec, pacat_process_time_msec

  • Glitch prevention: inter_utterance_silence_ms, drop_audio_after_interrupt_ms

  • Reboot robustness: pulse_server, pulse_sink


Testing

ros2 topic pub /tts/say std_msgs/String "{data: 'Hello from ROS two.'}"


14. Results

After all iterations:

  • warmup happens once, silently,

  • first audible sound ≈ 800–900 ms,

  • no pops, no noise,

  • short sentences are never dropped,

  • works after reboot and container restart.

Subjectively:

The robot finally feels responsive.


15. Conclusion

Achieving low-latency TTS in ROS 2 is not about choosing the “fastest” model.
It is about systems thinking:

  • understanding process lifecycles

  • respecting audio pipelines

  • controlling initialization costs

  • designing for persistence

  • and measuring what humans actually perceive

Piper is an excellent TTS engine, but only when integrated carefully.
With the right architecture, it becomes a solid, production-ready component for robotic systems.

In the next article, we will zoom out and explore how this TTS node fits into a full ROS audio architecture alongside ASR, state machines, and interaction logic.

But none of that matters unless the robot can speak quickly, reliably, and naturally.

That foundation is now in place.