Real-Time Linux for Robotics - Latency, Jitter, Scheduling, and What Actually Breaks

Real-Time Linux for Robotics - Latency, Jitter, Scheduling, and What Actually Breaks

If you build robots long enough, you eventually stop asking “is it fast?” and start asking “is it predictable?” That is the real question.

A perception stack can tolerate occasional spikes. A web backend can absorb them. A speech pipeline can hide some of them behind buffering. A torque loop, a fieldbus cycle, or a hardware interface running at 1 kHz cannot. In robotics, what hurts you is rarely average latency. It is tail latency, wakeup latency, IRQ latency, page faults, cache disruption, and all the little scheduling decisions that turn a nominal 1 ms loop into a 1.8 ms loop once every few seconds. That is how a stable controller becomes noisy, how a robot starts “feeling soft,” and how integration tests pass while production robots still miss deadlines. Real-time computing is about meeting timing constraints deterministically, not merely being fast on average. ROS 2’s own real-time material makes exactly that distinction: a real-time system is defined by deadline guarantees and bounded jitter, not just low latency.

This is why “real-time Linux” matters. It is also why the topic is badly misunderstood. Many teams think installing a PREEMPT_RT kernel magically makes the machine deterministic. It does not. PREEMPT_RT changes the kernel’s behavior in important ways: it makes much more of the kernel preemptible, turns most interrupts into schedulable threads, and replaces many spinning locks with priority-inheritance-aware mutexes. That dramatically reduces the time a high-priority task can be blocked by non-preemptible kernel paths. But it does not erase hardware latency, firmware behavior, driver quirks, memory faults, thermal throttling, or bad application design. A real-time kernel gives you a fighting chance. It does not give you absolution.

Hard real-time, soft real-time, and the uncomfortable middle

The clean definition is simple. Hard real-time means missing a deadline is a system failure. Soft real-time means deadlines matter, but occasional misses degrade quality rather than constituting immediate failure. ROS 2 design documents also call out firm real-time, where late results are simply useless even if the system does not “fail” in the safety sense. In robotics, most deployed systems are not purely one or the other; they are mixed-criticality systems. A motor current loop, safety monitor, or time-sensitive fieldbus task may be hard or firm real-time. A trajectory follower may be firm real-time. Local planning, teleoperation UI, logging, speech, and vision are usually soft real-time.

That distinction matters because Linux with PREEMPT_RT is extremely useful for a wide range of robotic control workloads, but “hard real-time” still depends on the full stack, not just the scheduler. The kernel documentation is very careful here: PREEMPT_RT minimizes non-preemptible regions, but some paths remain non-preemptible, including scheduler internals, low-level interrupt handling, and entry code. On top of that, your timing budget can still be violated by firmware, DMA behavior, cache misses, device drivers, or vendor BSP code that was never designed for determinism. So the intellectually honest position is this: PREEMPT_RT can support genuinely real-time robotics workloads, but whether your system is hard real-time in the engineering sense depends on the whole platform and on measurement, not branding.

The four numbers that actually matter

When teams say “latency,” they usually compress four different problems into one word.

First, there is release latency: how long after an event occurs can the relevant task begin running? That includes interrupt delivery, IRQ thread wakeup, scheduler decision time, and CPU availability.

Second, there is execution time: once scheduled, how long does the control step itself take? That includes reading hardware, DDS or middleware overhead, controller math, logging side effects, cache locality, and any locking contention.

Third, there is response latency: the total time from event to completed action, including both wakeup and execution.

Fourth, there is jitter: the variation of any of those from cycle to cycle.

In control, jitter is often more damaging than a modest fixed delay. A fixed delay can often be modeled and compensated. Variable delay acts like time-varying phase lag. It changes your effective loop timing, erodes phase margin, distorts observers, and injects noise into the derivative and feedforward terms. A controller designed around a 1 kHz update rate does not just need “fast enough” execution. It needs bounded release time and bounded computation time. That is why ROS 2’s real-time docs emphasize periodic loops, maximum allowable jitter, and the need to avoid nondeterministic operations in the real-time path.

A good mental model is this: if your loop period is TTT, your controller was designed for TTT, and your system intermittently behaves as if the sample time were T+ΔTT + \Delta TT+ΔT, then your effective plant-plus-controller timing keeps changing under you. That shows up as degraded tracking first, then overshoot, then oscillation, then watchdog trips, then hardware faults. The robot often “works” before it visibly fails. That is the dangerous phase.

What PREEMPT_RT actually changes

On a standard kernel, plenty of kernel execution happens in contexts where a higher-priority task cannot immediately preempt what is currently running. PREEMPT_RT changes that balance. The kernel documentation describes the central mechanisms clearly: many spinlock_t uses become preemptible rtmutex-based locking with priority inheritance, and interrupts are forced-threaded so that much more kernel work is brought under normal scheduling control. The result is that a high-priority real-time task can preempt more of the system and sees lower worst-case scheduling latency.

That “threaded interrupts” point is especially important in robotics. Sensor interrupts, network activity, and device interrupts stop being mysterious black boxes that run in hard IRQ context for unpredictable amounts of time. On PREEMPT_RT, most of them become schedulable kernel threads, which means you can reason about them, assign priorities, observe contention, and debug priority inversions more sanely. But exceptions remain. Some IRQ paths still use raw_spinlock_t, some interrupts are not threaded, and some drivers are simply not written with real-time constraints in mind. PREEMPT_RT improves the ceiling; it does not eliminate the floor.

A subtle but important 2026-era detail is that, according to sched(7), real-time capability is now controlled by CONFIG_PREEMPT_RT, while the general preemption model is configured separately. In other words, “RT” is no longer just one line in an older mental model of kernel preemption. That matters when you reason about contemporary kernels, distro builds, and BSPs.

Scheduling policies: SCHED_FIFO, SCHED_RR, and SCHED_DEADLINE

The scheduling policy is where many robotics systems go wrong.

SCHED_FIFO is the classic default for user-space control loops. It is simple and powerful: if a higher-priority FIFO thread becomes runnable, it immediately preempts lower-priority normal threads, and a running FIFO thread continues until it blocks, yields, or is preempted by an even higher-priority real-time thread. There is no timeslice. That is exactly why it is useful for deterministic loops—and exactly why it is dangerous if you misuse it. A runaway FIFO thread can starve everything below it.

SCHED_RR is basically SCHED_FIFO with a time quantum among equal-priority peers. It is often less common in tight robotics control, but it can be useful when you truly want round-robin behavior among tasks at the same priority.

SCHED_DEADLINE is conceptually the most elegant policy for periodic control jobs. Linux implements it using GEDF plus CBS, with explicit runtime, deadline, and period parameters. The kernel performs admission control and throttles overruns so that one deadline task does not silently consume everyone else’s reserved bandwidth. In theory, this is much closer to how control engineers think: “this task needs X microseconds every Y microseconds, with deadline Z.” In practice, many robotics stacks still default to FIFO because the ecosystem, tooling, and habits are more mature there. But if you are building a serious cyclic executive or mixed real-time workload, SCHED_DEADLINE deserves more attention than it usually gets.

There is another trap here: priority is architecture, not decoration. If your DDS receive threads, EtherCAT master, hardware interface, controller manager, and logging threads all inherit random defaults, then your “real-time” system is just a collection of wishful thinking. ros2_control’s controller manager documentation is revealing: it explicitly tries to set its main thread to SCHED_FIFO priority 50, and its determinism notes repeatedly warn that jitter in the main control loop matters directly. It also notes that long controller updates and controller mode switches introduce jitter, and that asynchronous controllers can miss cycles when their own update time exceeds the configured schedule. That is the real system speaking.

Why latency breaks control loops

At a high level, delay hurts control loops in three ways.

The first is phase lag. Every extra millisecond between sensing and actuation pushes the control action further into the future relative to the plant dynamics. For fast plants—arms with torque control, balancing robots, drones, dynamic locomotion—that lost phase margin is expensive.

The second is sample-time variability. Most controllers are tuned assuming a known update period. Jitter means your effective discretization keeps changing. Even if the average period is right, the instantaneous timing is not. State estimation, velocity differentiation, anti-windup, and contact handling all suffer.

The third is deadline bunching. When a loop is delayed, work often piles into the next cycle. You then see a bad pattern: late read, stale state, overlong compute, late write, then a compressed next cycle with less slack, which increases contention and produces another miss. One bad wakeup can become a burst of bad cycles.

This is why “it only spikes once in a while” is not reassuring. In robotics, rare timing faults are the ones that evade testing and show up under thermal load, after long uptimes, or while multiple subsystems synchronize badly. ROS 2’s real-time docs explicitly call out page faults, dynamic allocation, and indefinite blocking in the real-time path as primary nondeterministic hazards for exactly this reason.

CPU isolation: necessary, useful, and often half-done

If you want deterministic behavior, stop forcing your control loop to share a CPU with the rest of the operating system.

Linux provides several building blocks here. sched_setaffinity() constrains a task to specific CPUs, and cpusets extend that with cgroup-based control over allowed CPUs and memory nodes. The kernel’s housekeeping documentation explains the deeper part: once you isolate CPUs, the system still needs housekeeping CPUs to run unbound workqueues, timers, vmstat updates, watchdog activity, remote scheduler tick work, and other kernel noise. Isolation is not just “pin task to core 3.” It is “create a place where the robot’s real-time work runs, and a different place where the rest of the machine is allowed to be noisy.”

This is also where people misuse isolcpus. The boot parameters isolcpus, nohz_full, rcu_nocbs, IRQ affinity, cpuset partitions, and user-space affinity all solve different parts of the problem. A pinned thread on a supposedly isolated CPU can still be disturbed by managed IRQs, kernel ticks, RCU callback work, or unbound kernel tasks if the housekeeping side is not configured coherently. The kernel docs explicitly describe how isolcpus=domain, nohz_full, and managed IRQ isolation move different classes of work away from isolated CPUs.

The practical lesson is that CPU isolation is a system design exercise, not a single command. Pin the control loop, pin or prioritize IRQ threads that feed it, dedicate at least one housekeeping CPU, and verify with tracing that noise is actually gone instead of merely renamed.

Docker: not the enemy, but not free

Containers do not change physics. They do, however, change how resource control, capabilities, cgroups, and CPU topology are applied. That can absolutely affect real-time behavior.

The good news is that Docker can run real-time scheduled workloads. Docker’s documentation is explicit: the host kernel must support RT group scheduling, the daemon needs a configured --cpu-rt-runtime, and the container needs CAP_SYS_NICE, appropriate rtprio limits, and its own --cpu-rt-runtime if you want to use real-time scheduling inside the container. Compose also supports cpu_rt_runtime, cpu_rt_period, and cpuset. In other words, containerization does not inherently forbid real-time scheduling.

The bad news is that containers make it easier to hide mistakes. A loop inside a container can look pinned while the host still schedules IRQs and housekeeping work on the same CPUs. CPU quotas meant for fairness can create throttling artifacts. Namespace and cgroup defaults can obscure where runtime is actually going. And teams often grant CAP_SYS_NICE without also designing host-level CPU isolation, which means the application gets a higher priority but not a quieter machine.

My rule of thumb is simple: containers are fine for robotics real-time workloads if the host is designed for determinism first. Containerization is an operational boundary, not a timing boundary. Do not use Docker to paper over a scheduling design you have not solved on bare metal.

Memory, page faults, and the things that make your loop nondeterministic

The most common self-inflicted real-time bug in user space is still dynamic memory on the hot path.

ROS 2’s real-time documentation could not be clearer: to keep a periodic loop deterministic, you must avoid page faults, dynamic allocation and deallocation, and synchronization primitives that may block indefinitely. Their demo explicitly uses mlockall to keep memory resident and explains that page faults and heap activity are unacceptable inside the real-time path.

That advice is easy to underestimate because modern CPUs are fast enough to hide the average cost. But average cost is not the issue. A single major fault, allocator path, or contended mutex at the wrong moment is enough to blow the cycle budget. The same applies to logging, exceptions, string formatting, lazy initialization, dynamic plugin loading, and “convenient” abstractions that allocate on first use. In a real-time loop, the correct question is not “is this usually cheap?” It is “can I bound it?”

Measurement: stop benchmarking throughput and start measuring noise

A real-time system should be profiled like a real-time system.

Recent Linux kernels provide RTLA—the Real-Time Linux Analysis toolkit—with tools like rtla osnoise and rtla timerlat. The kernel documentation describes rtla timerlat as a way to measure timer IRQ and thread latency, and rtla osnoise as a way to quantify operating-system noise while preemption, IRQs, and softirqs are enabled. timerlat even exposes options to bound idle exit latency via /dev/cpu_dma_latency, which is a good reminder that C-state exit time is itself part of your control latency budget.

This is the mature workflow: measure baseline wakeup latency, isolate CPUs, pin IRQs, re-measure, introduce your middleware, re-measure, introduce your container boundary, re-measure, introduce your actual controller and hardware traffic, re-measure. If you cannot show a histogram of worst-case latency under representative load, you do not yet know whether your system is real-time enough.

Jetson, ROS 2, and what breaks in the real world

Jetson is a fascinating robotics platform because it combines compute density with Linux flexibility, and that makes it extremely attractive for integrated perception-plus-control machines. NVIDIA provides a real-time kernel path for Jetson Orin devices, but their own documentation is cautious: on Jetson Linux r36.5, the RT kernel support for Orin AGX, Orin NX, and Orin Nano is explicitly described as Developer Preview quality. You can install it via OTA packages or build it manually, and you can switch between the generic and RT kernels from the boot configuration. That is useful, but it is not the same as saying the entire BSP and driver stack is fully validated for hard real-time control.

NVIDIA’s own real-time-related guidance also exposes where embedded reality bites. Their Jetson documentation notes that, with the RT kernel, CPU DVFS is disabled by default to meet latency requirements, and that GPU DVFS is also disabled in that context due to latency concerns. They further note that GPU kernel driver functionality and performance are not guaranteed with PREEMPT_RT in that documentation stream. In the Jetson Linux r38.4 release notes, NVIDIA also calls out a specific source of latency spikes: TPM-based hardware random number generation can contribute multiple hundreds of microseconds to real-time latency. That is exactly the kind of issue that explains why “Linux is fast” is not a serious real-time statement.

For ROS 2 on Jetson, the practical pattern is therefore obvious. Use PREEMPT_RT where the hardware stack supports it. Prefer steady-time-based control loops. Keep ros2_control’s real-time path lean. Push perception, logging, and other soft-real-time work away from the isolated control CPUs. Be deeply suspicious of anything that wakes the GPU or changes power state in the middle of a supposedly deterministic control cycle. And measure everything on the actual BSP you intend to ship, not on a desktop Linux machine that merely “resembles” your embedded target.

Where Linux real-time is enough, and where it is not

For a very large share of robotics systems, real-time Linux is the right answer. It is especially compelling when you need one machine to do control, ROS 2 orchestration, networking, observability, fieldbus integration, and some level of perception or local intelligence. It is a strong fit for arms, mobile robots, industrial interfaces, teleoperation, voice and multimodal systems with local actuation, and many cyber-physical systems where bounded millisecond or sub-millisecond behavior matters but dedicated RTOS separation would be too restrictive or too operationally expensive. The Linux Foundation real-time project continues to maintain PREEMPT_RT patch lines, and the current documentation and kernel tooling are far better than they were a few years ago.

But Linux is not always the whole answer. If your safety case depends on strict certification, if your worst-case budget is so tight that firmware and vendor drivers are unacceptable sources of uncertainty, or if you need deeply bounded behavior across devices that do not have strong RT support, then a co-kernel approach, dedicated MCU, drive-local control, or a proper RTOS partition may still be the better system design. Real robotics platforms often end up hybrid: Linux for orchestration and high-level control, dedicated real-time controllers for the innermost loops.

That is not a weakness. It is just engineering.

The practical conclusion

The real lesson of real-time Linux is not “use PREEMPT_RT.” It is “design for determinism.”

Use PREEMPT_RT because it moves far more of the machine under schedulable control. Use SCHED_FIFO or SCHED_DEADLINE intentionally, not decoratively. Isolate CPUs properly and leave housekeeping somewhere else. Lock memory. Remove allocations from the hot path. Treat Docker as an operational tool, not a timing guarantee. Keep ROS 2 control paths brutally simple. Measure with RTLA. And assume that what breaks your loop will not be the obvious thing—it will be the one timer, one IRQ, one allocator path, one power-management feature, or one driver path you forgot was still on the critical path.

That is what actually breaks control loops.

And that is also why real-time Linux is such a differentiator for robotics: once you understand latency and jitter as first-class design constraints rather than performance footnotes, you stop building software that is merely fast—and start building systems that stay stable when the robot, the OS, and the real world all become busy at the same time.