Every time you install a package on one node, you SSH into the next and repeat the same steps manually. Rebuilding a failed node means retracing commands from memory, while configurations slowly drift between machines until no two nodes are exactly alike. That is the operational reality of managing a multi-node homelab cluster without automation.

Ansible on Turing Pi 2.5 changes that model. Instead of relying on imperative SSH sessions and manual recovery steps, you define the desired state of the cluster once and apply it consistently across every RK1 node. This guide picks up directly after the initial cluster setup and turns a manually managed RK1 cluster into reproducible infrastructure-as-code.

Quick Overview: Ansible Automation for Turing Pi 2.5

This guide builds a complete Ansible workflow for a 3-node RK1 cluster:

  • Control machine: Arch Linux with Ansible installed via pacman
  • Cluster: Turing Pi 2.5 with three RK1 nodes running Ubuntu 22.04 ARM64
  • What you build: inventory, base provisioning, node-specific playbooks, rolling updates, Vault secrets, and a live recovery demo

No extra hardware required. Nodes run entirely on eMMC. For context on what the cluster can run, the use cases overview and the full build guide are worth reviewing before going deeper into automation.

Prerequisites:

  • SSH key access to all three nodes (covered in the cluster setup article)
  • Static IPs already configured
  • Ubuntu 22.04 ARM64 on all nodes

Part 1: Installing Ansible and Verifying SSH Access

Ansible runs from a central control machine and communicates with cluster nodes over SSH. The commands in this guide should be executed from your laptop, desktop, or other host system acting as the control machine, not from the RK1 nodes themselves. No Ansible packages need to be installed on the cluster nodes.

Install Ansible on the control machine using the package manager for your Linux distribution or operating system. The official Ansible installation documentation covers supported platforms and installation methods for Arch Linux, Ubuntu, Debian, Fedora, and other environments.

After installation, verify SSH connectivity to one of the cluster nodes and confirm the system is running ARM64:

ssh <username>@<node-ip> "hostname && uname -m"

Example:

ssh [email protected] "hostname && uname -m"

Example output:

rk1-node
aarch64

Replace the username and IP address with values matching your environment, then repeat the check against the remaining cluster nodes before continuing.

If SSH prompts for a password, configure key-based authentication before moving forward. Repeated password prompts become impractical once playbooks begin targeting multiple nodes. The initial cluster setup guide covers SSH key setup in detail.

Part 2: Ansible Inventory File for Turing Pi 2.5

The inventory file is the foundation of every Ansible workflow. It defines how Ansible connects to cluster nodes and allows playbooks to target systems by role instead of repeating IP addresses and SSH settings throughout multiple files.

Create a inventory.ini file on the control machine:

# inventory.ini

[primary]
node1 ansible_host=<node1-ip> ansible_user=ubuntu

[workers]
node2 ansible_host=<node2-ip> ansible_user=ubuntu
node3 ansible_host=<node3-ip> ansible_user=ubuntu

[inference_node]
node3 ansible_host=<node3-ip> ansible_user=ubuntu

[services_node]
node2 ansible_host=<node2-ip> ansible_user=ubuntu

[cluster:children]
primary
workers

[all:vars]
ansible_ssh_private_key_file=~/.ssh/id_rsa
ansible_python_interpreter=/usr/bin/python3

Replace <node1-ip>, <node2-ip>, and <node3-ip> with the static IPs assigned to each RK1 node. These should match whatever you configured during the initial cluster setup. Every subsequent command in this guide that references a node IP assumes those same values. The names node1, node2, and node3 are inventory aliases used by Ansible and do not need to match the actual Linux hostnames on the RK1 nodes.

The cluster group targets all nodes with a single directive. inference_node and services_node let role-specific playbooks hit exactly the right machine without duplicating logic or hardcoding IPs across multiple files.

ansible all -i inventory.ini -m ping

This guide uses a 3-node setup matching the Turing Pi 2.5 with three RK1 modules. If your cluster has two or four nodes, adjust the inventory accordingly. Add or remove host entries under the relevant groups and update the IP addresses to match. The playbooks themselves do not need to change.

Troubleshooting: SSH Key Not Found

The most common failure at this step is Ansible reporting UNREACHABLE with this message:

Failed to connect to the host via ssh

This means the SSH keypair does not exist on the control machine yet. Generate it:

ssh-keygen -t rsa -b 4096 -C "turing-pi-ansible"

Press Enter at prompts to accept the default path and skip the passphrase. Then copy the public key to each node:

ssh-copy-id ubuntu@<node1-ip>
ssh-copy-id ubuntu@<node2-ip>
ssh-copy-id ubuntu@<node3-ip>

Each command prompts for the node’s ubuntu password once. After that, re-run the ping command from above.

Expected output:

node1 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

node2 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

node3 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

Every node should return pong. Resolve any SSH or authentication issues before continuing to the provisioning playbooks.

Part 3: Base Provisioning Playbook

The base provisioning playbook establishes a consistent baseline across the cluster: standardized hostnames, timezone configuration, updated packages, and a working Docker environment. Instead of manually preparing nodes one at a time, the same configuration is applied reproducibly across every RK1 system.

Create a base.yml playbook with nano on the control machine:

---
- name: Base provisioning for all cluster nodes
  hosts: cluster
  become: true

  vars:
    cluster_timezone: "UTC"

  tasks:
    - name: Set hostname
      hostname:
        name: "{{ inventory_hostname }}"

    - name: Set timezone
      timezone:
        name: "{{ cluster_timezone }}"

    - name: Update apt cache and upgrade packages
      apt:
        update_cache: true
        upgrade: safe
        cache_valid_time: 3600
        update_cache_retries: 3

    - name: Install base packages
      apt:
        name:
          - curl
          - git
          - htop
          - ca-certificates
          - docker-compose
          - docker.io
        state: present
      register: docker_pkg_install

    - name: Reset Docker failed state after fresh install
      shell: systemctl reset-failed docker.service || true
      when: docker_pkg_install.changed

    - name: Enable and start Docker socket
      systemd:
        name: docker.socket
        enabled: true
        state: started

    - name: Wait for Docker daemon to become available
      shell: docker info
      register: docker_ready
      retries: 10
      delay: 5
      until: docker_ready.rc == 0
      changed_when: false

    - name: Add ubuntu user to docker group
      user:
        name: ubuntu
        groups: docker
        append: true

The hostname task aligns the Linux hostnames with the inventory aliases defined earlier, keeping node naming consistent across Ansible output, SSH sessions, and cluster management workflows.

Run the playbook from the control machine:

ansible-playbook -i inventory.ini base.yml

Idempotency and Why It Matters

Every task in this playbook is idempotent. Running the playbook a second time should produce mostly ok results with few or no changed states. The apt module skips packages already installed, the hostname module skips systems already configured correctly, and Docker is ignored if the service is already running.

This matters operationally because the playbook becomes safe to rerun at any time. When new nodes are added later, the same playbook can provision them into a known-good state without introducing unintended changes across the rest of the cluster.

Infrastructure that can be safely rerun is infrastructure you can trust.

Part 4: Node-Specific Playbooks

Base provisioning establishes a consistent baseline across the cluster, but infrastructure roles are rarely identical. Role-specific workloads should be separated into dedicated playbooks that target the appropriate inventory groups. This keeps automation modular and avoids unnecessary conditional logic inside a single monolithic playbook.

Inference Node (32 GB RK1)

The inference node is responsible for local LLM workloads using Ollama. The RK3588 inference workflow covers model loading, quantization, and performance tuning in detail. This playbook focuses only on reproducible deployment and configuration.

Create an inference.yml playbook on the control machine:

# inference.yml
---
- name: Deploy Ollama on inference node
  hosts: inference_node
  become: true

  tasks:
    - name: Install Ollama
      shell: curl -fsSL https://ollama.com/install.sh | sh
      args:
        creates: /usr/local/bin/ollama

    - name: Create Ollama systemd override directory
      file:
        path: /etc/systemd/system/ollama.service.d
        state: directory
        mode: '0755'

    - name: Configure Ollama parallelism
      copy:
        dest: /etc/systemd/system/ollama.service.d/override.conf
        content: |
          [Service]
          Environment="OLLAMA_NUM_PARALLEL=4"
      when: ansible_facts['memtotal_mb'] > 28000
      notify:
        - Reload systemd
        - Restart Ollama

    - name: Enable and start Ollama service
      systemd:
        name: ollama
        enabled: true
        state: started

  handlers:
    - name: Reload systemd
      systemd:
        daemon_reload: true

    - name: Restart Ollama
      systemd:
        name: ollama
        state: restarted

The creates argument keeps the installation task idempotent by skipping the Ollama installer if the binary already exists. This is a practical pattern when automating third-party installation scripts that do not provide native Ansible modules.

The when: ansible_facts['memtotal_mb'] > 28000 condition demonstrates where hardware-aware automation becomes useful. Larger memory nodes can safely handle more parallel inference workloads, while smaller systems continue using the default configuration. Inventory groups handle permanent role separation, while when conditions are better suited for hardware-specific tuning within those groups.

Run the playbook from the control machine:

ansible-playbook -i inventory.ini inference.yml --ask-become-pass

Services Node (16 GB RK1)

The services node runs Pi-hole for network-level DNS filtering. The Pi-hole and Tailscale setup explains the networking workflow in detail. This playbook focuses only on automated deployment and recovery.

Create a services.yml playbook on the control machine:

# services.yml
---
- name: Deploy Pi-hole on services node
  hosts: services_node
  become: true

  vars:
    pihole_password: "change-this-password"

  tasks:
    - name: Create Pi-hole directory
      file:
        path: /opt/pihole
        state: directory
        owner: ubuntu
        group: ubuntu
        mode: '0755'

    - name: Write Pi-hole Docker Compose file
      copy:
        dest: /opt/pihole/docker-compose.yml
        owner: ubuntu
        group: ubuntu
        mode: '0644'
        content: |
          version: "3"

          services:
            pihole:
              image: pihole/pihole:latest
              container_name: pihole
              environment:
                TZ: "UTC"
                WEBPASSWORD: "{{ pihole_password }}"
              volumes:
                - ./etc-pihole:/etc/pihole
                - ./etc-dnsmasq.d:/etc/dnsmasq.d
              ports:
                - "53:53/tcp"
                - "53:53/udp"
                - "80:80/tcp"
              restart: unless-stopped

    - name: Start Pi-hole
      shell: docker-compose up -d
      args:
        chdir: /opt/pihole
      register: compose_output
      changed_when:
        - "'Creating' in compose_output.stdout or 'Starting' in compose_output.stdout"

The Pi-hole deployment uses a dedicated inventory group so the service is deployed only to the intended node without additional host filtering logic. Separating workloads by role keeps playbooks predictable as infrastructure grows and additional services are added later.

Run the playbook from the control machine:

ansible-playbook -i inventory.ini services.yml --ask-become-pass

Troubleshooting: Port 53 Already in Use

If the playbook fails with this error:

failed to bind host port 0.0.0.0:53/tcp: address already in use

another service is already occupying port 53 on the target node. One common cause on Ubuntu is systemd-resolved.

Fix it by SSHing into the node and disabling the DNS stub listener:

ssh ubuntu@<node2-ip>
sudo sed -i 's/#DNSStubListener=yes/DNSStubListener=no/' /etc/systemd/resolved.conf
sudo systemctl restart systemd-resolved
exit

Then rerun the playbook from the control machine:

ansible-playbook -i inventory.ini services.yml --ask-become-pass

The inline pihole_password variable is acceptable for initial testing, but long-running environments should move secrets into Ansible Vault.

Part 5: Rolling Updates Across the Cluster

Package updates are the most common maintenance task on any cluster. Running apt upgrade manually across three nodes means three SSH sessions, no guarantee of consistency, and no safety net if an update breaks something mid-way. A rolling update playbook handles this in a single command while keeping at least one node available throughout the process.

Create an update.yml playbook on the control machine:

# update.yml
---
- name: Rolling package updates across cluster
  hosts: cluster
  become: true
  serial: 1
  max_fail_percentage: 0

  tasks:
    - name: Update apt cache
      apt:
        update_cache: true
        cache_valid_time: 0

    - name: Upgrade packages
      apt:
        upgrade: safe
      register: upgrade_result

    - name: Check if reboot is required
      stat:
        path: /var/run/reboot-required
      register: reboot_required

    - name: Reboot if required
      reboot:
        reboot_timeout: 120
        msg: "Reboot triggered by Ansible after package upgrade"
      when: reboot_required.stat.exists

    - name: Wait for node to come back online
      wait_for_connection:
        delay: 10
        timeout: 120
      when: reboot_required.stat.exists

Run it from the control machine:

ansible-playbook -i inventory.ini update.yml --ask-become-pass

serial: 1 is the key directive. It tells Ansible to complete the full task sequence on one node before moving to the next. Node1 updates and reboots if needed, comes back online, then Ansible moves to node2, and so on. The cluster is never fully offline during the update.

max_fail_percentage: 0 stops the entire playbook if any node fails during its update cycle. A failed update on node2 will not proceed to node3. This prevents a bad package from propagating across the cluster before you can investigate.

The reboot tasks are conditional, nodes that do not require a reboot after upgrading skip those steps entirely. On a cluster that was recently updated, most nodes will complete the playbook with only ok results and no reboots.

Expected output on a cluster with pending updates:

PLAY [Rolling package updates across cluster] ****

TASK [Update apt cache] **************************
ok: [node1]

TASK [Upgrade packages] **************************
changed: [node1]

TASK [Check if reboot is required] ***************
ok: [node1]

TASK [Reboot if required] ************************
changed: [node1]

TASK [Wait for node to come back online] *********
ok: [node1]

# Repeats for node2, then node3

Part 6: Ansible Vault for Secrets

Storing the Pi-hole admin password in plaintext is a habit that compounds as the cluster grows. Vault encrypts secrets at rest and integrates cleanly into the existing playbook structure.

Create a secrets file:

EDITOR=nano ansible-vault create secrets.yml

Ansible defaults to vi for editing Vault files. Setting EDITOR=nano uses nano instead, which is consistent with the rest of this guide.

Add your secrets inside the editor that opens:

# secrets.yml (stored encrypted on disk)
vault_pihole_password: "your-pihole-admin-password"

Reference Vault variables in playbooks exactly as you would any other variable. The services.yml playbook already uses vault_pihole_password through the pihole_password local variable. No changes to the playbook are needed.

Run any playbook with Vault decryption:

ansible-playbook -i inventory.ini services.yml --ask-vault-pass

The same workflow applies to any credentials added as the cluster grows, API keys, database passwords, or auth tokens for additional services. As you expand with services from the self-hosted apps catalog, Vault keeps all secrets out of plaintext files without changing how playbooks reference them.

Part 7: The Self-Healing Demo

This part is the operational payoff of everything built above. The proof is recovery speed.

Before continuing: this test deliberately breaks a worker node. Only perform it on a homelab cluster you can safely reprovision.

Step 1: Break Node 2

SSH into node2 and simulate a failed state:

ssh ubuntu@<node2-ip>
sudo apt remove --purge docker.io docker-compose -y
exit

Pi-hole is now down. Docker is gone. The services node is in a failed state.

Step 2: Verify the Failure

Confirm Docker is gone from the control machine:

ansible node2 -i inventory.ini -m shell -a "docker --version"

Expected output:

node2 | FAILED | rc=127 >>
/bin/sh: docker: command not found

Step 3: Run Recovery

One command from the control machine:

ansible-playbook -i inventory.ini base.yml services.yml --ask-become-pass

Ansible runs against all nodes. On node1 and node3, nearly every task returns ok because nothing changed. On node2, Docker is reinstalled, docker-compose is restored, and Pi-hole comes back up automatically.

Step 4: Verify Recovery

ansible node2 -i inventory.ini -m shell -a "docker ps"

Expected output:

node2 | SUCCESS | rc=0 >>
CONTAINER ID   IMAGE                  COMMAND      STATUS
xxxxxxxxxxxx   pihole/pihole:latest   "/s6-init"   Up x seconds

The entire recovery completes in under 5 minutes on a healthy cluster with a good network connection. No troubleshooting steps recalled from memory. No SSH sessions into multiple nodes. One command restoring a known good state.

That is the operational payoff of treating the cluster as code.

Part 8: Recommended Project Structure

As playbook count grows, flat files become hard to navigate. A clean structure keeps automation maintainable:

turing-pi-ansible/
├── inventory.ini
├── secrets.yml
├── group_vars/
│   ├── all.yml
│   ├── inference_node.yml
│   └── services_node.yml
├── base.yml
├── inference.yml
├── services.yml
├── update.yml
└── teardown.yml

group_vars/ replaces inline vars blocks inside playbooks. Variables in inference_node.yml automatically apply to any play targeting that group. Variables in all.yml apply cluster-wide. Playbooks stay clean.

update.yml handles rolling package updates across the cluster as covered in Part 5. teardown.yml is the inverse of provisioning: it stops services, removes packages, and resets nodes to a clean state. Write it while the cluster is functional so you have a reliable decommission path. This pattern is also useful in Incus container workflows where environments cycle frequently.

The deeper point: when nodes are reproducible, they stop being precious. A failed node is a known state Ansible can restore, not an emergency requiring careful manual intervention. You can verify post-recovery performance against the RK1 benchmark baselines to confirm the node returned to expected throughput after reprovisioning.

Related Articles

What You’ve Built

The cluster now runs as reproducible infrastructure-as-code. Every node has a defined base state enforced by base.yml. Role-specific workloads deploy to the correct systems without duplicated configuration. Package updates roll across the cluster safely without taking nodes offline simultaneously. Secrets stay encrypted and centralized in Vault. Recovering from a failed node becomes a predictable provisioning workflow instead of a manual rebuild process.

Ansible on Turing Pi 2.5 automation removes the dependency on memory and ad-hoc SSH sessions. The cluster is no longer a collection of carefully configured machines that only exist in one administrator’s head. It becomes a reproducible environment that can be rebuilt, expanded, and audited through version-controlled playbooks.

That operational consistency is what makes infrastructure automation valuable long after the initial setup is finished.

FAQ

Does Ansible work on ARM64 clusters like Turing Pi 2.5?

Yes. Ansible runs on the control machine and communicates over SSH. The control machine handles all Python logic and sends commands to the nodes. The RK1 nodes run Ubuntu 22.04 ARM64, which requires no special Ansible configuration beyond setting ansible_python_interpreter=/usr/bin/python3 in inventory. ARM64 is fully supported as a managed target.

How do I manage secrets in Ansible playbooks for a homelab cluster?

Use Ansible Vault. Run EDITOR=nano ansible-vault create secrets.yml, add variables like vault_pihole_password and vault_tailscale_authkey, and reference them in playbooks using standard variable syntax. Run playbooks with --ask-vault-pass. The encrypted file is safe to commit to version control. The plaintext password is never written to disk outside the Vault file.

Is Ansible worth using for a small ARM homelab cluster?

Yes, primarily because of recovery speed and operational consistency. Small clusters are manageable manually until services drift out of sync, a node fails, or infrastructure needs to be rebuilt quickly. Without automation, recovery depends on remembered SSH commands and undocumented configuration changes. Ansible pays for the initial setup effort the first time a system needs to be reprovisioned or recovered.

How long does it take to reprovision a Turing Pi 2.5 node with Ansible?

Under 5 minutes for a full base provisioning and service recovery on a healthy cluster with a good network connection. Most of that time is package installation. On subsequent runs against a node with no changes, ansible-playbook completes in under 60 seconds with changed=0 across most tasks. The self-healing demo in Part 7 confirms this timing against a node with Docker removed and Pi-hole down.