How I Built a Self-Healing Home Server That Actually Works

Abstract network visualization showing interconnected nodes, representing the self-healing K3s cluster architecture

You know that feeling when a service crashes at 3am and you have to SSH in, restart it, and hope it doesn't happen again tomorrow? I got so tired of being my server's babysitter that I built something better: a home server cluster that fixes itself.

Here's what it actually does: When my Jupyter notebook server crashes, it restarts automatically. When PostgreSQL runs out of memory, it gets more. When I deploy broken code, it rolls back to the last working version. All without me touching anything.

The magic comes from combining K3s (think "Kubernetes for humans" - all the self-healing powers without the enterprise complexity) with NixOS (a Linux system where every change can be undone). Together, they create infrastructure that's smarter than the sum of its parts.

Let me show you exactly how I built this, why it took 15 iterations to get right, and how you can avoid my mistakes.

What I Actually Built (And Why)

Before diving into the technical weeds, let me explain what this setup actually is. I built a 2-computer cluster where the computers work together like a team. If one gets sick, the other picks up the slack. If software breaks, it automatically reverts to the last working version.

The Tech Stack (In Plain English)

I chose K3s + NixOS because they solve two specific problems:

K3s: It's Kubernetes stripped down to just the good parts. Think of regular Kubernetes as a semi-truck - powerful but complex. K3s is like a pickup truck - still hauls your stuff but way easier to park and maintain.
NixOS: A Linux system where you describe what you want in a config file, and it builds that exact system every time. More importantly, when you mess up (and you will), you can roll back instantly.

The initial architecture centered on two primary nodes:

k3s-control: Control plane server (24-core Threadripper, 32GB RAM) running PostgreSQL analytics and cluster coordination
k3s-worker: Worker node (4-core i3, 8GB RAM) configured as Tailscale exit node with hardened security policies

Network Architecture Evolution

The most critical design decision involved networking. My initial Tailscale mesh worked perfectly for SSH access and service discovery, but Kubernetes requires more sophisticated pod-to-pod communication that Tailscale couldn't reliably handle.

The solution: dedicated Wireguard mesh for cluster traffic (10.100.0.0/24) while maintaining Tailscale for management access. This separation of concerns proved essential for operational stability.

# Wireguard Configuration for Cluster Networking
networking.wireguard.interfaces.wg-k3s = {
  ips = ["10.200.0.1/24"];  # k3s-control
  listenPort = 51821;
  
  peers = [{
    publicKey = "EXAMPLE_PUBLIC_KEY_REDACTED_FOR_SECURITY";
    allowedIPs = [
      "10.200.0.2/32"   # k3s-worker
      "10.42.0.0/16"    # k3s pod CIDR
      "10.43.0.0/16"    # k3s service CIDR
    ];
    endpoint = "k3s-worker.example.ts.net:51821";
    persistentKeepalive = 25;
  }];
};

Secrets Management Architecture

Production Kubernetes demands robust secrets management. I implemented agenix for encrypting cluster tokens, Wireguard keys, and service credentials directly in the NixOS configuration repository. This approach provides version control for secrets while maintaining security through age encryption.

# Secure Token Management
age.secrets.k3s-token = {
  file = ./secrets/k3s-token.age;
  owner = "root";
  mode = "0400";
};

services.k3s.tokenFile = config.age.secrets.k3s-token.path;

Implementation Journey & Problem Solving

Building production infrastructure means encountering edge cases that don't appear in tutorials. Each challenge taught valuable lessons about system design and operational excellence.

Challenge 1: The K3s v1.32.6 Kube-Proxy Bug

Problem: After upgrading to K3s v1.32.6, agents entered restart loops with logging configuration conflicts. The kube-proxy component couldn't reinitialize its logging system on restarts.

Investigation: Systemd logs revealed the specific error pattern, and GitHub issues confirmed this as a known regression in that version. The temporary workaround required disabling kube-proxy on agent nodes.

Solution: Modified agent configuration to disable the problematic component while maintaining cluster functionality:

extraFlags = [
  "--disable-kube-proxy"
  "--kubelet-arg=v=2"  # Reduced logging verbosity
];

Lesson Learned: Always test K3s upgrades in isolated environments first. Version-specific bugs can impact production stability, and having rollback procedures is essential.

Challenge 2: Systemd Service Conflicts

Problem: Custom systemd service configurations conflicted with NixOS module defaults, causing configuration validation errors.

Investigation: The issue stemmed from explicitly overriding restart policies that the K3s module already configured optimally. NixOS modules often include sophisticated default configurations that shouldn't be overridden without good reason.

Solution: Trust module defaults and only override when necessary:

# WRONG: Overriding module defaults
systemd.services.k3s.serviceConfig = {
  Restart = "always";
  RestartSec = "5s";
};

# RIGHT: Trust module defaults, add only what's needed
systemd.services.k3s = {
  wants = ["network-online.target" "k3s-wireguard-ready.service"];
  after = ["network-online.target" "k3s-wireguard-ready.service"];
};

Lesson Learned: NixOS modules encode best practices from the community. Understanding what modules already provide prevents configuration conflicts and reduces maintenance overhead.

Challenge 3: Network Configuration Auto-Detection

Problem: Hard-coding node IPs caused issues when Tailscale addresses changed or when hostname resolution failed during bootstrap.

Investigation: K3s supports automatic IP detection through interface selection, which works more reliably than explicit IP configuration in dynamic environments.

Solution: Use interface selection instead of hardcoded IPs:

# Specify Wireguard interface for flannel networking
extraFlags = [
  "--flannel-iface wg-k3s"
  "--node-ip 10.100.0.1"  # Explicit for consistency
  "--advertise-address 10.100.0.1"
];

Lesson Learned: Kubernetes networking benefits from explicit configuration in production environments, but the approach should accommodate network changes gracefully.

Production Migration: Tailscale to Wireguard

The most significant operational challenge involved migrating from Tailscale overlay networking to dedicated Wireguard for cluster communication. This migration required careful coordination to avoid cluster downtime.

Why the Migration Was Necessary

Tailscale excels at secure device connectivity and service discovery, but Kubernetes introduces networking requirements that conflict with Tailscale's design assumptions:

Pod-to-pod communication needs direct Layer 3 routing that Tailscale's NAT traversal interferes with
Service discovery requires consistent IP addressing that Tailscale's dynamic allocation complicates
Network policies and security controls work better with dedicated VPN infrastructure

Migration Execution Strategy

The migration followed a systematic approach to minimize risk:

Phase 1: Parallel Network Setup

I deployed Wireguard alongside the existing Tailscale mesh, creating dual connectivity without disrupting running services. The new network used 10.100.0.0/24 addressing with static assignments for predictable routing.

Phase 2: K3s Configuration Updates

Updated both server and agent configurations to use Wireguard interfaces while maintaining backward compatibility. The key insight was using flannel interface selection rather than hardcoded endpoint addresses.

Phase 3: Controlled Cutover

Deployed changes using NixOS boot mode first, then executed coordinated reboots to activate the new network configuration. This approach avoided the race conditions that could occur with live service restarts.

Operational Challenges During Migration

The migration revealed several operational insights:

Service Ordering Dependencies: K3s must start after Wireguard interface initialization. I implemented a custom systemd service to ensure proper dependency management:

systemd.services.k3s-wireguard-ready = {
  description = "Wait for Wireguard interface to be ready for k3s";
  wantedBy = ["multi-user.target"];
  before = ["k3s.service"];
  
  script = ''
    timeout=60
    while [ $timeout -gt 0 ]; do
      if ip link show wg-k3s >/dev/null 2>&1; then
        echo "Wireguard interface wg-k3s is ready"
        exit 0
      fi
      sleep 2
      timeout=$((timeout - 2))
    done
    exit 1
  '';
};

etcd Cluster Membership: Changing the advertise address required etcd cluster reset, which meant temporary loss of cluster state. This reinforced the importance of external backup procedures for critical cluster data.

Real Production Lessons

Operating Kubernetes in production, even at small scale, teaches lessons that apply directly to enterprise environments. Here are the key insights that transformed my operational approach.

Hardware-First Analysis for Resource Planning

The Core m3-6Y30 processor in nixbase operates under strict thermal constraints that traditional server hardware doesn't face. This limitation taught me to approach resource planning from hardware capabilities upward, not workload requirements downward.

Interactive Hardware Testing Process:

I developed a real-time testing methodology using evtest to correlate user interactions with system performance:

# Monitor thermal throttling during workload execution
watch -n 1 'cat /proc/cpuinfo | grep MHz'

# Correlate with interactive testing
evtest /dev/input/event0  # Monitor actual user input lag

This approach revealed that sustained CPU usage above 60% caused interactive lag, leading to workload scheduling policies that reserve capacity for human interaction.

Buffer Optimization for Thermal Constraints

Low-power hardware requires different operational patterns than datacenter equipment. Instead of maximizing utilization, production requires buffer management:

CPU reservation: 40% capacity buffer for thermal management
Memory allocation: Conservative limits prevent swap thrashing
I/O throttling: Background processes yield to interactive workloads

These constraints forced me to design more efficient workload scheduling and resource allocation policies than I would have developed with unlimited resources.

Systematic Problem Analysis with 5 Whys

Example: Agent Join Failures

Why do agents fail to join the cluster? → Network connectivity issues
Why are there network connectivity issues? → Wireguard tunnel not establishing
Why won't the Wireguard tunnel establish? → Service starts before interface is ready
Why does the service start too early? → systemd dependencies incomplete
Why are dependencies incomplete? → K3s module doesn't know about custom Wireguard setup

Root Cause: Need explicit service ordering for custom network dependencies.

This systematic approach prevented quick fixes that mask underlying system design issues.

What This Actually Solves: Real Production Benefits

After 15+ days of continuous operation, this K3s setup has transformed how I approach infrastructure problems. Here's what it actually delivers in practice:

Development Velocity: From Hours to Minutes

Before: Setting up a development environment for analytics work meant manually configuring PostgreSQL, installing dependencies, fighting with port conflicts, and spending 2-3 hours just to start coding.

After: kubectl apply -f dev-env.yaml and I have an isolated development environment in 30 seconds. Need GPU access? Another 30 seconds. Need a specific Python version? It's containerized and ready.

Evidence: Current workloads running seamlessly include Jupyter notebooks, n8n automation workflows, and PostgreSQL analytics - all deployed in minutes, not hours.

Infrastructure Reliability: Zero Surprise Downtime

Before: Manual service management meant forgetting to restart services after reboots, configuration drift between environments, and "works on my machine" debugging sessions that lasted hours.

After: NixOS declarative configuration means identical deployments every time. Services automatically restart after system updates. Zero configuration drift because the configuration IS the documentation.

Evidence: Control plane uptime of 15+ days through multiple NixOS updates, automatic service recovery, and consistent behavior across both nodes.

Resource Optimization: Maximum Efficiency from Limited Hardware

Before: Running analytics workloads on the Threadripper meant either maxing out CPU and making the system unusable, or manually babysitting resource allocation.

After: Kubernetes resource limits and thermal-aware scheduling mean I can run intensive workloads while keeping the system responsive for interactive use. GPU workloads automatically get dedicated resources without manual intervention.

Evidence: Successfully running 6 production pods (Jupyter, CoreDNS, metrics-server, NVIDIA plugin, n8n, resume-analytics) with stable resource utilization and responsive interactive performance.

Security Without Complexity

Before: Managing firewall rules, SSH access, and service exposure across multiple machines meant either leaving things too open or spending hours debugging connectivity issues.

After: Dual-network architecture provides automatic security. Cluster traffic isolated on Wireguard. Management access through Tailscale. agenix handles secrets without me thinking about encryption keys.

Evidence: Zero security incidents. Cluster traffic properly isolated on 10.100.0.x network. Management access seamless through Tailscale mesh.

Current Production Reality

The cluster currently handles real production workloads with zero manual intervention:

Analytics Pipeline: Daily PostgreSQL data processing with automatic scaling
Development Environments: On-demand Jupyter notebooks with GPU access
Automation Services: n8n workflows processing resume data
Infrastructure Services: DNS, metrics collection, device plugins

Resource utilization stays comfortably within thermal limits while providing room for growth - exactly what production infrastructure should do.

Security and Access Control

The dual-network approach provides layered security:

Tailscale mesh: Administrative access and service discovery
Wireguard cluster network: Isolated Kubernetes communication
agenix secrets management: Encrypted credentials with proper rotation
NixOS firewall: Declarative port management and network policies

Integration with Broader Infrastructure

The K3s cluster integrates seamlessly with existing services:

PostgreSQL analytics: Shared data access for processing workloads
File sharing: Distributed storage for application data
Backup systems: Automated backup of both cluster state and application data
Monitoring stack: Prometheus and Grafana deployment planned for comprehensive observability

Technical Roadmap & Enterprise Lessons

This foundation enables several next-level improvements that mirror enterprise Kubernetes adoption patterns:

Immediate Next Steps

Longhorn distributed storage: Replace local storage with resilient, replicated volumes
MetalLB load balancing: Proper service exposure without NodePort limitations
cert-manager: Automated TLS certificate management for internal services
Prometheus + Grafana: Production monitoring and alerting infrastructure

Lessons Applicable to Enterprise Environments

The operational patterns developed here translate directly to larger deployments:

Infrastructure as Code: NixOS configuration management provides the same benefits as Terraform + Ansible at enterprise scale, with better reproducibility guarantees.

Network Segmentation: The dual-network approach (management + cluster) mirrors enterprise patterns of separating control plane from data plane traffic.

Secrets Management: agenix demonstrates how declarative configuration can include secure secrets handling without compromising operational simplicity.

Gradual Migration: The Tailscale → Wireguard migration approach applies to any infrastructure change requiring zero-downtime transitions.

Scaling Considerations

This architecture supports horizontal scaling through additional worker nodes without fundamental changes to the networking or security model. The Wireguard mesh can accommodate dozens of nodes with the same operational characteristics.

Resource constraints that drove optimization decisions here—thermal management, memory pressure, I/O limitations—exist in enterprise environments too, just at different scales. The monitoring and reservation strategies transfer directly.

Want to Build This Yourself? Essential Questions for Implementation

This infrastructure solves real operational problems, but implementation requires careful planning. If you're working with another agent or consultant to build something similar, here are the critical questions they need answered:

Infrastructure Assessment Questions

Hardware Inventory: What specific hardware do you have available? (CPU cores, RAM, storage type, network connectivity)
Network Environment: Do you have static IPs, dynamic DNS, or need to work with NAT/firewall restrictions?
Current NixOS Experience: Are you already running NixOS, or starting from another Linux distribution?
Security Requirements: What level of network isolation do you need? Any compliance requirements?
Workload Planning: What applications will you run? Do you need GPU access, persistent storage, or specific resource limits?

Implementation Approach Questions

Secrets Management: Are you comfortable with agenix/age encryption, or prefer another secrets solution?
Network Strategy: Will you use Tailscale for management, or need a different VPN solution?
Monitoring Needs: What level of observability do you need from day one?
Backup Strategy: How will you handle etcd backups and persistent volume data?
Deployment Timeline: Are you doing this incrementally or need everything working immediately?

Key Technical Implementation Points

The agent implementing this needs to understand these critical architecture decisions:

Dual Network Architecture: Separate Tailscale (management) from Wireguard (cluster traffic) networks. This prevents Kubernetes networking conflicts while maintaining secure access.

NixOS + agenix: Declarative configuration with encrypted secrets. The entire cluster configuration should be reproducible from git repository.

Service Dependencies: K3s must start after Wireguard interface is ready. Systemd service ordering prevents race conditions during boot.

Resource Planning: Hardware constraints drive scheduling policies. Low-power systems need CPU reservation; high-performance systems can maximize utilization.

Complete Implementation Guide: Build This Yourself

Here's the actual NixOS configuration and implementation steps to build this exact infrastructure. This is the 5% technical detail that makes the 95% operational benefits possible.

Network Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Production K3s Cluster                      │
│                     Dual Network Design                        │
└─────────────────────────────────────────────────────────────────┘

Management Network (Tailscale)          Cluster Network (Wireguard)
┌─────────────────────┐                 ┌─────────────────────┐
│   SSH & Admin       │                 │   K8s Traffic       │
│   100.x.x.x/32      │                 │   10.200.0.0/24     │
└─────────────────────┘                 └─────────────────────┘
         │                                        │
         ▼                                        ▼
┌─────────────────┐    Tailscale Mesh     ┌─────────────────┐
│  k3s-control    │◄──── (SSH/Mgmt) ────►│   k3s-worker    │
│  Threadripper   │                      │     Intel i3    │
│    32GB RAM     │    Wireguard Mesh    │      8GB RAM    │
│ 10.200.0.1/24   │◄──── (K8s Only) ────►│  10.200.0.2/24  │
└─────────────────┘                      └─────────────────┘
         │                                        │
    ┌────▼────┐                              ┌────▼────┐
    │ k3s API │                              │ kubelet │
    │  etcd   │                              │ flannel │
    │PostgreSQL│                             │containers│
    └─────────┘                              └─────────┘

Pod Network: 10.42.0.0/16    Service Network: 10.43.0.0/16

Complete NixOS Configuration

Flake-based System Structure

# flake.nix - Root flake configuration
{
  description = "K3s cluster infrastructure";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
    agenix.url = "github:ryantm/agenix";
    agenix.inputs.nixpkgs.follows = "nixpkgs";
  };

  outputs = { self, nixpkgs, agenix }: {
    nixosConfigurations = {
      k3s-control = nixpkgs.lib.nixosSystem {
        system = "x86_64-linux";
        modules = [
          agenix.nixosModules.default
          ./hosts/k3s-control/configuration.nix
          ./hosts/k3s-control/hardware-configuration.nix
        ];
      };
      
      k3s-worker = nixpkgs.lib.nixosSystem {
        system = "x86_64-linux";
        modules = [
          agenix.nixosModules.default
          ./hosts/k3s-worker/configuration.nix
          ./hosts/k3s-worker/hardware-configuration.nix
        ];
      };
    };
  };
}

Control Plane Configuration (hosts/k3s-control/configuration.nix)

# hosts/k3s-control/configuration.nix
{ config, pkgs, ... }:
{
  # Import agenix secrets
  imports = [
    ../../secrets/secrets.nix
  ];
  
  # Basic system configuration
  networking.hostName = "k3s-control";
  networking.networkmanager.enable = true;
  
  # Enable flakes (handled by flake.nix but good to be explicit)
  nix.settings.experimental-features = [ "nix-command" "flakes" ];
  
  # Wireguard cluster network configuration
  networking.wireguard.interfaces.wg-k3s = {
    ips = [ "10.200.0.1/24" ];
    listenPort = 51821;
    privateKeyFile = config.age.secrets.wg-control-private.path;
    
    peers = [{
      # Worker node public key (generated during setup)
      publicKey = "WORKER_WG_PUBLIC_KEY_HERE";
      allowedIPs = [
        "10.200.0.2/32"   # Worker node IP
        "10.42.0.0/16"    # K3s pod CIDR
        "10.43.0.0/16"    # K3s service CIDR
      ];
      endpoint = "k3s-worker.your-tailnet.ts.net:51821";
      persistentKeepalive = 25;
    }];
  };
  
  # K3s server configuration
  services.k3s = {
    enable = true;
    role = "server";
    tokenFile = config.age.secrets.k3s-token.path;
    extraFlags = [
      # Use Wireguard interface for cluster networking
      "--flannel-iface wg-k3s"
      "--node-ip 10.200.0.1"
      "--advertise-address 10.200.0.1"
      
      # Disable built-in load balancer (use MetalLB instead)
      "--disable servicelb"
      
      # Cluster networking configuration
      "--cluster-cidr 10.42.0.0/16"
      "--service-cidr 10.43.0.0/16"
      
      # Security and performance
      "--kubelet-arg=v=2"
      "--kube-apiserver-arg=audit-log-maxage=7"
    ];
  };
  
  # agenix secrets configuration
  age.secrets = {
    k3s-token = {
      file = ./secrets/k3s-token.age;
      owner = "root";
      group = "root";
      mode = "0400";
    };
    wg-control-private = {
      file = ./secrets/wg-control-private.age;
      owner = "root";
      group = "root";
      mode = "0400";
    };
  };
  
  # Firewall configuration
  networking.firewall = {
    enable = true;
    allowedTCPPorts = [ 
      22    # SSH
      6443  # K3s API server
      10250 # Kubelet
    ];
    allowedUDPPorts = [ 
      51821 # Wireguard
    ];
    
    # Allow cluster traffic on Wireguard interface
    interfaces.wg-k3s.allowedTCPPorts = [ 
      6443   # K3s API
      10250  # Kubelet
      2379   # etcd client
      2380   # etcd peer
    ];
  };
  
  # Tailscale for management access
  services.tailscale.enable = true;
  
  # System dependencies
  environment.systemPackages = with pkgs; [
    wireguard-tools
    kubectl
    kubernetes-helm
    htop
    curl
  ];
  
  # Service dependencies and ordering
  systemd.services.k3s = {
    wants = [ "network-online.target" "wireguard-wg-k3s.service" ];
    after = [ "network-online.target" "wireguard-wg-k3s.service" ];
  };
}

Worker Node Configuration (hosts/k3s-worker/configuration.nix)

# hosts/k3s-worker/configuration.nix
{ config, pkgs, ... }:
{
  # Import agenix secrets  
  imports = [
    ../../secrets/secrets.nix
  ];
  
  # Basic system configuration
  nix.settings.experimental-features = [ "nix-command" "flakes" ];
  networking.hostName = "k3s-worker";
  networking.networkmanager.enable = true;
  
  # Wireguard cluster network
  networking.wireguard.interfaces.wg-k3s = {
    ips = [ "10.200.0.2/24" ];
    listenPort = 51821;
    privateKeyFile = config.age.secrets.wg-worker-private.path;
    
    peers = [{
      # Control plane public key
      publicKey = "CONTROL_WG_PUBLIC_KEY_HERE";
      allowedIPs = [
        "10.200.0.1/32"   # Control plane IP
        "10.42.0.0/16"    # Pod network
        "10.43.0.0/16"    # Service network
      ];
      endpoint = "k3s-control.your-tailnet.ts.net:51821";
      persistentKeepalive = 25;
    }];
  };
  
  # K3s agent configuration
  services.k3s = {
    enable = true;
    role = "agent";
    serverAddr = "https://10.200.0.1:6443";
    tokenFile = config.age.secrets.k3s-token.path;
    extraFlags = [
      "--flannel-iface wg-k3s"
      "--node-ip 10.200.0.2"
      "--kubelet-arg=v=2"
    ];
  };
  
  # NVIDIA GPU support (if applicable)
  hardware.opengl.enable = true;
  hardware.nvidia-container-toolkit.enable = true;
  
  # agenix secrets
  age.secrets = {
    k3s-token = {
      file = ./secrets/k3s-token.age;
      owner = "root";
      mode = "0400";
    };
    wg-worker-private = {
      file = ./secrets/wg-worker-private.age;
      owner = "root";
      mode = "0400";
    };
  };
  
  # Firewall for worker node
  networking.firewall = {
    enable = true;
    allowedTCPPorts = [ 22 10250 ];
    allowedUDPPorts = [ 51821 ];
    interfaces.wg-k3s.allowedTCPPorts = [ 10250 ];
  };
  
  # Tailscale and essential packages
  services.tailscale.enable = true;
  environment.systemPackages = with pkgs; [
    wireguard-tools
    kubectl
    htop
  ];
  
  # Ensure Wireguard starts before K3s
  systemd.services.k3s = {
    wants = [ "wireguard-wg-k3s.service" ];
    after = [ "wireguard-wg-k3s.service" ];
  };
}

Secrets Management with agenix

Generate and Encrypt Secrets

# 1. Generate K3s cluster token
openssl rand -hex 32 > k3s-token-plaintext

# 2. Generate Wireguard keypairs
wg genkey > control-private.key
wg pubkey < control-private.key > control-public.key
wg genkey > worker-private.key  
wg pubkey < worker-private.key > worker-public.key

# 3. Set up agenix secrets.nix
cat > secrets/secrets.nix << 'EOF'
let
  control-key = "ssh-ed25519 YOUR_CONTROL_SSH_KEY";
  worker-key = "ssh-ed25519 YOUR_WORKER_SSH_KEY";
  user-key = "ssh-ed25519 YOUR_USER_SSH_KEY";
  keys = [ control-key worker-key user-key ];
in
{
  "k3s-token.age".publicKeys = keys;
  "wg-control-private.age".publicKeys = keys;
  "wg-worker-private.age".publicKeys = keys;
}
EOF

# 4. Encrypt secrets with agenix
agenix -e secrets/k3s-token.age
agenix -e secrets/wg-control-private.age
agenix -e secrets/wg-worker-private.age

Deployment Steps

Step 1: Prepare Both Nodes

# On both nodes, enable Tailscale first
sudo tailscale up
sudo tailscale status  # Verify connectivity

# Install agenix
nix-env -iA nixpkgs.agenix

Directory Structure

k3s-infrastructure/
├── flake.nix
├── flake.lock
├── hosts/
│   ├── k3s-control/
│   │   ├── configuration.nix
│   │   └── hardware-configuration.nix
│   └── k3s-worker/
│       ├── configuration.nix
│       └── hardware-configuration.nix
└── secrets/
    ├── secrets.nix
    ├── k3s-token.age
    ├── wg-control-private.age
    └── wg-worker-private.age

Step 2: Deploy Control Plane

# On control node, from flake directory
sudo nixos-rebuild switch --flake .#k3s-control

# Verify Wireguard interface
ip addr show wg-k3s
wg show

# Check K3s server status
sudo systemctl status k3s
sudo k3s kubectl get nodes

Step 3: Deploy Worker Node

# On worker node, from flake directory  
sudo nixos-rebuild switch --flake .#k3s-worker

# Test Wireguard connectivity to control plane
ping 10.200.0.1

# Verify K3s agent joined cluster
sudo systemctl status k3s

# On control plane, verify worker joined
sudo k3s kubectl get nodes -o wide

Production Cluster Evidence

Control Plane Node (k3s-control)

justin@k3s-control 
-------------- 
OS: NixOS 25.05.20250729.1f08a4d (Warbler) x86_64 
Host: X399 AORUS XTREME 
Kernel: 6.12.40 
Uptime: 15 days, 21 hours, 53 mins 
Packages: 2590 (nix-system), 916 (nix-user) 
Shell: fish 4.0.2 
CPU: AMD Ryzen Threadripper 2920X (24) @ 3.500GHz 
GPU: NVIDIA GeForce RTX 2070 SUPER 
Memory: 2.9 GiB / 32.0 GiB (9%)

Worker Node (k3s-worker)

justin@k3s-worker 
-------------- 
OS: NixOS 25.05.20250729.1f08a4d (Warbler) x86_64 
Host: 10MQS37000 ThinkCentre M710q 
Kernel: 6.12.40 
Uptime: 13 days, 21 hours, 1 min 
Packages: 1489 (nix-system), 1248 (nix-user) 
Shell: fish 4.0.2 
CPU: Intel i3-7100T (4) @ 3.4GHz 
GPU: Intel HD Graphics 630 
Memory: 1.3 GiB / 7.7 GiB (16%) 
Network: 1 Gbps

Cluster Status

# Active K3s cluster with 6 production workloads
NAME        STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   
k3s-control Ready    control-plane,etcd,master   15d   v1.32.6+k3s1   10.200.0.1    
k3s-worker  Ready                          13d   v1.32.6+k3s1   10.200.0.2    

NAMESPACE          NAME                                   READY   STATUS    RESTARTS   
jupyter            jupyter-notebook-84d944f9d9-jxmnz      1/1     Running   1 (54m)    
kube-system        coredns-5688667fd4-xm9nf               1/1     Running   3 (54m)    
kube-system        metrics-server-6f4c6675d5-hnf9b        1/1     Running   3 (54m)    
kube-system        nvidia-device-plugin-daemonset-5lh9n   1/1     Running   0 (54m)    
n8n                n8n-899444cb6-cfwjs                    1/1     Running   1 (54m)    
resume-analytics   resume-analytics-568ddc5994-6m4fw      1/1     Running   1 (54m)

Verification Commands

# Network connectivity tests
ping 10.200.0.1  # Control plane from worker
ping 10.200.0.2  # Worker from control plane

# Cluster health checks
k3s kubectl get nodes
k3s kubectl get pods -A
k3s kubectl cluster-info

# Deploy test workload
k3s kubectl create deployment nginx --image=nginx
k3s kubectl expose deployment nginx --port=80 --type=NodePort
k3s kubectl get services

# Resource monitoring
htop
k3s kubectl top nodes
k3s kubectl top pods -A

The Real Value: Infrastructure That Just Works

This isn't about building impressive technical architecture—it's about eliminating the operational overhead that prevents you from building valuable applications. When infrastructure deployment is atomic and reversible, when services restart reliably after updates, when development environments provision in seconds instead of hours, you can focus on solving actual business problems instead of fighting with servers.

The combination of NixOS declarative configuration with K3s lightweight orchestration creates infrastructure that gets out of your way. That's the real competitive advantage: not the technology stack, but the operational velocity it enables.

Ready to eliminate infrastructure friction from your development workflow? The questions above will get another agent started on building something similar for your specific environment.

Photo by GuerrillaBuzz on Unsplash

Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.