RK1 Benchmarks: CPU, Inference, Memory & Thermals

RK1 compute module benchmarks measure how the RK3588 behaves under sustained real-world load, not short burst tests. CPU throughput, memory bandwidth, thermal behavior, power draw, and AI inference were tested on RK1 modules in a Turing Pi 2.5 cluster under continuous workloads until the system reached steady-state performance.

Quick Overview: RK1 Compute Module Benchmarks

What this article covers: CPU throughput, memory bandwidth, thermal behavior, power draw, and AI inference on the RK3588 under sustained real-world load
AI inference: 5-8 t/s for 4B models, 3-7 t/s for 7B, 2-4 t/s for 14B (llama.cpp, CPU-only, 30-minute sustained run)
CPU performance: ~13.6-13.9k events/sec sustained (sysbench, 8 threads), stable with adequate cooling
Memory bandwidth: 17-22 GB/s sequential, drops sharply under concurrent memory-bound workloads
Thermal behavior: Steady-state 68-74°C under full load with passive cooling, throttle onset above ~80°C
Power efficiency: 4-5W idle per node, 10-12W under full load, full 4-node cluster peaks around 80W

Part 1: CPU Performance

Spec baseline: 4× Cortex-A76 @ 2.4 GHz + 4× Cortex-A55 @ 1.8 GHz, 8nm process. Cache layout per RK3588 datasheet:

Core	L1-I	L1-D	L2	L3 (shared)
Cortex-A76 (×4)	64 KB	64 KB	512 KB/core	3 MB
Cortex-A55 (×4)	32 KB	32 KB	128 KB/core	3 MB

Total L2: 2 MB for the A76 cluster, 512 KB for the A55 cluster. Both clusters share a unified 3 MB system-level L3 cache. Under eight simultaneous threads, that 3 MB fills fast, cache misses push traffic directly onto the LPDDR4X bus, which compounds the bandwidth constraints.

Sustained CPU load (sysbench, 8 threads):

Throughput: ~13.6-13.9k events/sec (stable across sustained load)

Across sustained runs (10 minutes), throughput remains stable, indicating that with adequate cooling the RK3588 maintains near-peak CPU performance without significant thermal throttling.

Performance is consistent across 8 GB, 16 GB, and 32 GB configurations, confirming that CPU-bound workloads are not limited by memory capacity. Thread-level variance remains high due to big.LITTLE scheduling. Some threads shift onto A55 cores under load, introducing uneven per-thread performance without significantly impacting total throughput.

The A55 cores run at significantly lower IPC than the A76 cores (often around half, depending on workload). Under sustained load, the scheduler distributes work across both clusters, balancing performance and efficiency rather than keeping all threads pinned to A76 cores. Short burst workloads (CI jobs, cron tasks) remain mostly on A76 cores, while sustained parallel workloads increasingly involve A55 participation.

Large working-set workloads show increased latency variance under sustained load due to higher cache and TLB pressure. Visible in perf stat output; impact is modest but consistent.

Kernel compile benchmark (Linux 6.1, `make -j8`):

A single RK1 node (eMMC) completes a Linux kernel build in ~28–32 minutes.

Compile time is strongly influenced by storage performance. On eMMC, builds are noticeably slower due to filesystem and IO overhead. NVMe-backed setups reduce compile times significantly, often by 20-30%.

This highlights a broader constraint on RK3588 systems: for IO-heavy workloads like compilation, storage can become the limiting factor before CPU throughput. In practice, upgrading to NVMe has a larger impact on build times than increasing CPU parallelism on a single node.

Part 2: Memory Bandwidth: The Real Performance Ceiling

Measured memory bandwidth:

Test	Result
STREAM (ideal sequential)	~21-22 GB/s
mbw (MCBLOCK)	~17-19 GB/s
mbw (MEMCPY)	~8-9 GB/s

The RK3588 uses a 64-bit LPDDR4X interface with a theoretical peak of ~34 GB/s. In practice, achievable bandwidth varies significantly depending on access patterns. STREAM represents an upper bound under near-perfect sequential access. Optimized block copies (MCBLOCK) approach this limit, while general-purpose memory operations (MEMCPY) operate at roughly half that throughput.

Memory bandwidth remains consistent across 8 GB, 16 GB, and 32 GB configurations, as it is constrained by the SoC’s memory interface rather than capacity.

Concurrency cliff:

Parallel memory-bound sessions	Throughput per session
1	100% (~21-22 GB/s)
2	~60-65% (~13-14 GB/s)
3	~40-50% (~8-10 GB/s)

Running multiple memory-bound workloads concurrently shows a clear drop in per-workload throughput as they compete for the shared memory bus. This behavior is consistent across sequential benchmarks and real-world workloads.

As concurrency increases, total system throughput does not scale linearly. Instead, workloads begin to interfere with each other due to bandwidth contention.

Part 3: Thermal Behavior and Power Draw

Test methodology: 25 minutes continuous full-CPU load, passive heatsink only.

Temperature curve:

Phase	Window	Temperature
Idle	–	38-42°C
Ramp-up	0-8 min	42-65°C
Steady-state	8-25 min	66-74°C
Throttle onset	>80°C	Frequency cap engaged

Under sustained load, A76 cores maintain ~2.2-2.3 GHz without frequency capping, confirming stable thermal behavior in this configuration. No throttling was observed below ~75°C based on cpufreq telemetry.

The Linux scheduler distributes work across A55 cores under sustained load, but this has minimal impact on aggregate throughput in practice. Hardware-level throttling typically requires poor airflow or enclosed setups; with even modest airflow, temperatures remain below throttling thresholds.

Power draw:

State	Draw
Idle (per RK1 node)	~4-5W
Mixed workloads (per node)	~6-9W
Full CPU + memory load (per node)	~10-12W

Cluster (4× RK1 on Turing Pi 2.5)	Draw
Idle	~16-18W
Normal to high load	~25-45W
Maximum (with peripherals)	up to ~80W

Part 4: RK1 AI Inference Performance

llama.cpp (CPU-only, GGUF, 30-minute sustained run):

Model	Tokens/sec
4B	5-8 t/s
7B	3-7 t/s
14B	2-4 t/s

For a detailed breakdown of local LLM performance, quantization, and real-world inference behavior on RK3588, see our dedicated guide on running AI locally.

Part 5: Workload Fit

RK1 handles well:

CI/CD and build pipelines. Burst workloads fit within A76 headroom
Container orchestration. Agent nodes
Single-session LLM inference (≤7B models)
Always-on edge services with intermittent compute demand
Low-power storage, networking, and utility pods

For anyone building the platform from scratch, the complete Turing Pi 2.5 setup guide covers hardware, BMC configuration, and K3s bootstrapping end to end. For a full picture of what workload types suit this hardware across different use cases, the Turing Pi 2.5 use cases guide covers practical deployment scenarios in detail.

Part 6: Previous Articles in This Series

Turing Pi 2.5 RK1 Complete Setup Guide: From Unboxing to a Running K3s Cluster Hardware assembly, BMC setup, and K3s from zero.
Turing Pi 2.5 Use Cases: What This ARM Cluster Is Actually Good For Honest workload fit analysis with real deployment scenarios.
Run LLMs Locally on ARM: RK3588 + Ollama + llama.cpp Guide Full inference stack configuration, model selection, and quantization.
K3s on Turing Pi 2.5: Persistent Storage and Load Balancing on ARM Storage classes, pod scheduling, and traffic routing across nodes.

RK1 Compute Module Benchmarks: CPU, AI Inference, Memory Bandwidth & Thermal Performance

RK1 Compute Module Benchmarks: CPU, AI Inference, Memory Bandwidth & Thermal Performance

Quick Overview: RK1 Compute Module Benchmarks

Part 1: CPU Performance

Sustained CPU load (sysbench, 8 threads):

Kernel compile benchmark (Linux 6.1, `make -j8`):

Part 2: Memory Bandwidth: The Real Performance Ceiling

Measured memory bandwidth:

Concurrency cliff:

Part 3: Thermal Behavior and Power Draw

Temperature curve:

Power draw:

Part 4: RK1 AI Inference Performance

llama.cpp (CPU-only, GGUF, 30-minute sustained run):

Part 5: Workload Fit

RK1 handles well:

Part 6: Previous Articles in This Series

RK1 Compute Module Benchmarks: CPU, AI Inference, Memory Bandwidth & Thermal Performance

RK1 Compute Module Benchmarks: CPU, AI Inference, Memory Bandwidth & Thermal Performance

Quick Overview: RK1 Compute Module Benchmarks

Part 1: CPU Performance

Sustained CPU load (sysbench, 8 threads):

Kernel compile benchmark (Linux 6.1, make -j8):

Part 2: Memory Bandwidth: The Real Performance Ceiling

Measured memory bandwidth:

Concurrency cliff:

Part 3: Thermal Behavior and Power Draw

Temperature curve:

Power draw:

Part 4: RK1 AI Inference Performance

llama.cpp (CPU-only, GGUF, 30-minute sustained run):

Part 5: Workload Fit

RK1 handles well:

Part 6: Previous Articles in This Series

Stay in touch

Check your inbox!

Error

Kernel compile benchmark (Linux 6.1, `make -j8`):