RK1 compute module benchmarks measure how the RK3588 behaves under sustained real-world load, not short burst tests. CPU throughput, memory bandwidth, thermal behavior, power draw, and AI inference were tested on RK1 modules in a Turing Pi 2.5 cluster under continuous workloads until the system reached steady-state performance.

Quick Overview: RK1 Compute Module Benchmarks

  • What this article covers: CPU throughput, memory bandwidth, thermal behavior, power draw, and AI inference on the RK3588 under sustained real-world load
  • AI inference: 5-8 t/s for 4B models, 3-7 t/s for 7B, 2-4 t/s for 14B (llama.cpp, CPU-only, 30-minute sustained run)
  • CPU performance: ~13.6-13.9k events/sec sustained (sysbench, 8 threads), stable with adequate cooling
  • Memory bandwidth: 17-22 GB/s sequential, drops sharply under concurrent memory-bound workloads
  • Thermal behavior: Steady-state 68-74°C under full load with passive cooling, throttle onset above ~80°C
  • Power efficiency: 4-5W idle per node, 10-12W under full load, full 4-node cluster peaks around 80W

Part 1: CPU Performance

Spec baseline: 4× Cortex-A76 @ 2.4 GHz + 4× Cortex-A55 @ 1.8 GHz, 8nm process. Cache layout per RK3588 datasheet:

CoreL1-IL1-DL2L3 (shared)
Cortex-A76 (×4)64 KB64 KB512 KB/core3 MB
Cortex-A55 (×4)32 KB32 KB128 KB/core3 MB

Total L2: 2 MB for the A76 cluster, 512 KB for the A55 cluster. Both clusters share a unified 3 MB system-level L3 cache. Under eight simultaneous threads, that 3 MB fills fast, cache misses push traffic directly onto the LPDDR4X bus, which compounds the bandwidth constraints.

Sustained CPU load (sysbench, 8 threads):

  • Throughput: ~13.6-13.9k events/sec (stable across sustained load)

Across sustained runs (10 minutes), throughput remains stable, indicating that with adequate cooling the RK3588 maintains near-peak CPU performance without significant thermal throttling.

Performance is consistent across 8 GB, 16 GB, and 32 GB configurations, confirming that CPU-bound workloads are not limited by memory capacity. Thread-level variance remains high due to big.LITTLE scheduling. Some threads shift onto A55 cores under load, introducing uneven per-thread performance without significantly impacting total throughput.

The A55 cores run at significantly lower IPC than the A76 cores (often around half, depending on workload). Under sustained load, the scheduler distributes work across both clusters, balancing performance and efficiency rather than keeping all threads pinned to A76 cores. Short burst workloads (CI jobs, cron tasks) remain mostly on A76 cores, while sustained parallel workloads increasingly involve A55 participation.

Large working-set workloads show increased latency variance under sustained load due to higher cache and TLB pressure. Visible in perf stat output; impact is modest but consistent.

Kernel compile benchmark (Linux 6.1, make -j8):

A single RK1 node (eMMC) completes a Linux kernel build in ~28–32 minutes.

Compile time is strongly influenced by storage performance. On eMMC, builds are noticeably slower due to filesystem and IO overhead. NVMe-backed setups reduce compile times significantly, often by 20-30%.

This highlights a broader constraint on RK3588 systems: for IO-heavy workloads like compilation, storage can become the limiting factor before CPU throughput. In practice, upgrading to NVMe has a larger impact on build times than increasing CPU parallelism on a single node.


Part 2: Memory Bandwidth: The Real Performance Ceiling

Measured memory bandwidth:

TestResult
STREAM (ideal sequential)~21-22 GB/s
mbw (MCBLOCK)~17-19 GB/s
mbw (MEMCPY)~8-9 GB/s

The RK3588 uses a 64-bit LPDDR4X interface with a theoretical peak of ~34 GB/s. In practice, achievable bandwidth varies significantly depending on access patterns. STREAM represents an upper bound under near-perfect sequential access. Optimized block copies (MCBLOCK) approach this limit, while general-purpose memory operations (MEMCPY) operate at roughly half that throughput.

Memory bandwidth remains consistent across 8 GB, 16 GB, and 32 GB configurations, as it is constrained by the SoC’s memory interface rather than capacity.

Concurrency cliff:

Parallel memory-bound sessionsThroughput per session
1100% (~21-22 GB/s)
2~60-65% (~13-14 GB/s)
3~40-50% (~8-10 GB/s)

Running multiple memory-bound workloads concurrently shows a clear drop in per-workload throughput as they compete for the shared memory bus. This behavior is consistent across sequential benchmarks and real-world workloads.

As concurrency increases, total system throughput does not scale linearly. Instead, workloads begin to interfere with each other due to bandwidth contention.


Part 3: Thermal Behavior and Power Draw

Test methodology: 25 minutes continuous full-CPU load, passive heatsink only.

Temperature curve:

PhaseWindowTemperature
Idle38-42°C
Ramp-up0-8 min42-65°C
Steady-state8-25 min66-74°C
Throttle onset>80°CFrequency cap engaged

Under sustained load, A76 cores maintain ~2.2-2.3 GHz without frequency capping, confirming stable thermal behavior in this configuration. No throttling was observed below ~75°C based on cpufreq telemetry.

The Linux scheduler distributes work across A55 cores under sustained load, but this has minimal impact on aggregate throughput in practice. Hardware-level throttling typically requires poor airflow or enclosed setups; with even modest airflow, temperatures remain below throttling thresholds.

Power draw:

StateDraw
Idle (per RK1 node)~4-5W
Mixed workloads (per node)~6-9W
Full CPU + memory load (per node)~10-12W
Cluster (4× RK1 on Turing Pi 2.5)Draw
Idle~16-18W
Normal to high load~25-45W
Maximum (with peripherals)up to ~80W

Part 4: RK1 AI Inference Performance

llama.cpp (CPU-only, GGUF, 30-minute sustained run):

ModelTokens/sec
4B5-8 t/s
7B3-7 t/s
14B2-4 t/s

For a detailed breakdown of local LLM performance, quantization, and real-world inference behavior on RK3588, see our dedicated guide on running AI locally.


Part 5: Workload Fit

RK1 handles well:

  • CI/CD and build pipelines. Burst workloads fit within A76 headroom
  • Container orchestration. Agent nodes
  • Single-session LLM inference (≤7B models)
  • Always-on edge services with intermittent compute demand
  • Low-power storage, networking, and utility pods

For anyone building the platform from scratch, the complete Turing Pi 2.5 setup guide covers hardware, BMC configuration, and K3s bootstrapping end to end. For a full picture of what workload types suit this hardware across different use cases, the Turing Pi 2.5 use cases guide covers practical deployment scenarios in detail.


Part 6: Previous Articles in This Series