Building an In-App Auto-Updater for a Containerized NixOS Deployment

I've been building Grapheon, a network graph analysis and visualization tool, and deploying it as a pair of Podman containers on NixOS. The stack is pretty straightforward: a FastAPI backend, a React frontend, both published as container images to GitHub Container Registry (GHCR), and wired together with a NixOS module that manages everything through systemd services.

It works great. But every time I cut a new release I'd have to SSH into the box, pull the new images, restart the services. Not exactly the "set it and forget it" experience I was going for.

So I decided to build an in-app auto-updater — the kind where you see a little banner saying "hey, there's a new version" and you click a button and it just... upgrades itself. Sounds simple enough, right?

What started as a straightforward trigger-file mechanism has evolved through a dozen iterations into something with pre-upgrade backups, separate backend/frontend version tracking, step-by-step progress reporting, and post-upgrade health checks. Here's where it stands today.

The Problem

Here's the thing about containerized deployments: the app running inside the container can't exactly restart itself. It doesn't have access to Podman, it doesn't know about systemd, and it definitely shouldn't be pulling its own replacement image. The container is a fish that needs to convince the ocean to swap it out for a different fish.

So the architecture needed two halves:

  1. The app side — the backend checks GitHub for new releases, the frontend shows a banner and lets you click "upgrade"
  2. The NixOS side — a systemd path unit watches for a trigger file, kicks off an upgrade handler that backs up data, pulls new images from GHCR, restarts services, and verifies the health of the new deployment

The container talks to the host through a shared volume. That's it. A file on disk is the entire IPC mechanism. Sometimes the simplest approach is the right one.

The App Side

Backend: Checking for Updates

The backend has an /api/updates router with three endpoints:

  • GET /api/updates — checks GitHub Releases API for the latest version
  • POST /api/updates/upgrade — writes a trigger file to kick off the host-side upgrade
  • GET /api/updates/status — reads the upgrade status file so the frontend can poll progress

One thing that matters here: Grapheon uses separate release tags for backend and frontend — backend-v0.8.7 and frontend-v0.9.1, for example. The two components version independently, so the update check has to handle each one:

def _extract_latest_versions(releases: list[dict]) -> tuple[Optional[str], Optional[str]]:
    """
    Extract the latest backend and frontend version tags from releases.
    Returns (backend_version, frontend_version) tuple.
    """
    backend_version = None
    frontend_version = None

    for release in releases:
        tag = release.get("tag_name", "")
        if release.get("prerelease", False):
            continue
        if tag.startswith("backend-v") and backend_version is None:
            backend_version = tag
        elif tag.startswith("frontend-v") and frontend_version is None:
            frontend_version = tag
        if backend_version and frontend_version:
            break

    return backend_version, frontend_version

The actual GitHub API call is straightforward — hit the releases endpoint, cache for an hour, fall back to stale cache if the API fails:

CACHE_TTL_SECONDS = 3600  # 1 hour

async def _fetch_github_releases() -> Optional[list[dict]]:
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://api.github.com/repos/BadgerOps/grapheon/releases",
            timeout=10.0,
        )
        response.raise_for_status()
        return response.json()

The version comparison is tuple-based — parse "backend-v0.8.7" into (0, 8, 7), compare with the current version, done. The parser strips the backend-v or frontend-v prefix before splitting on dots.

One fun bug I hit early on: the frontend version kept coming back as None because the update check was only reading the FRONTEND_VERSION environment variable, which isn't always set. The fix was adding a _detect_frontend_version() function that checks the env var first, then falls back to reading frontend/package.json. Small thing, but it meant the update check was silently skipping the frontend comparison entirely.

Backend: Triggering an Upgrade

When a user clicks "Upgrade now" in the UI, the backend writes a JSON file to /data/upgrade-requested. The key evolution from the initial version is that it now writes separate version fields for backend and frontend:

@router.post("/upgrade")
async def trigger_upgrade():
    # Check if already upgrading
    status_file = os.path.join(DATA_DIR, "upgrade-status.json")
    if os.path.exists(status_file):
        status_data = json.load(open(status_file))
        if status_data.get("status") == "running":
            raise HTTPException(status_code=409, detail="An upgrade is already in progress")

    # Extract separate backend/frontend target versions
    backend_tag, frontend_tag = _extract_latest_versions(releases)

    upgrade_request = {
        "requested_at": datetime.utcnow().isoformat() + "Z",
        "current_version": settings.APP_VERSION,
        "target_backend_version": target_backend_version,
        "target_frontend_version": target_frontend_version,
    }
    with open(os.path.join(DATA_DIR, "upgrade-requested"), "w") as f:
        json.dump(upgrade_request, f, indent=2)

That /data directory is a bind mount from the host's /srv/grapheon/data. So when the container writes upgrade-requested, it appears on the host filesystem. And that's where systemd picks it up.

Writing separate version fields fixed a real problem — when backend was at v0.8.6 and frontend was at v0.9.1, the old code would try to pull grapheon-frontend:v0.8.6, which didn't exist. "Manifest unknown" errors at 11pm are not fun.

Frontend: The Update Banner

The React frontend has two places where updates surface. First, an UpdateBanner component that lives at the top of the app and polls /api/updates every 60 minutes:

const POLL_INTERVAL = 60 * 60 * 1000; // 60 minutes

useEffect(() => {
    checkForUpdatesHandler();
    pollIntervalRef.current = setInterval(checkForUpdatesHandler, POLL_INTERVAL);
    return () => clearInterval(pollIntervalRef.current);
}, []);

When an update is available, you get a blue gradient banner with "What's new" (expands to show release notes) and "Upgrade now." The banner is dismissible per-version via localStorage, so it won't nag you if you choose to skip a release.

The upgrade flow goes through a few states: nullconfirmin_progresscompleted (auto-refresh after 3 seconds) or error (with retry). What changed since the first version is the in_progress state — it now shows a step-by-step progress timeline instead of just a spinner:

statusPollIntervalRef.current = setInterval(async () => {
    const statusResponse = await getUpgradeStatus();

    if (statusResponse.status === 'running') {
        // upgradeProgress is now a structured object
        setUpgradeProgress({
            message: statusResponse.message,
            step: statusResponse.step,
            totalSteps: statusResponse.total_steps,
            progress: statusResponse.progress,
        });
    } else if (statusResponse.status === 'completed') {
        clearInterval(statusPollIntervalRef.current);
        setUpgradeStep('completed');
        setTimeout(() => window.location.reload(), 3000);
    } else if (statusResponse.status === 'failed') {
        clearInterval(statusPollIntervalRef.current);
        setUpgradeStep('error');
    }
}, 5000);

The UI renders an animated progress bar showing completion percentage, a "Step N/5" counter, and a visual timeline with checkmarks for completed steps, a pulsing dot for the active step, and dimmed indicators for pending steps. It's a much better experience than staring at "Upgrading..." and wondering if anything is happening.

Second, there's a "Check for Updates" button on the Settings page with a modal that shows version comparison (with separate badges for UI and API versions), release notes, release date, and a GitHub link. Same upgrade flow, just triggered manually instead of by the polling interval.

The NixOS Side

This is where the real fun begins. The NixOS module (grapheon.nix) manages everything: the Podman network, both containers, a cloudflared tunnel, authentication credentials, and the auto-update machinery.

Dynamic Tags: The Version State File

The first thing I had to sort out was how to avoid hardcoding image tags in the nix config. If the NixOS module says ghcr.io/badgerops/grapheon-backend:v0.1.0, then that's what systemd starts, and you'd need a nixos-rebuild to change it. That defeats the whole point of an auto-updater.

The solution is a version state file. The nix module defines a defaultTag (used only on first boot), but after that, everything reads from /srv/grapheon/data/current-tag:

let
  defaultTag = "v0.3.0";
  versionFile = "${dataDirDb}/current-tag";

  readTag = ''
    if [ -f "${versionFile}" ]; then
      GRAPHEON_TAG="$(${pkgs.coreutils}/bin/cat "${versionFile}" | ${pkgs.coreutils}/bin/tr -d '[:space:]')"
    else
      GRAPHEON_TAG="${defaultTag}"
    fi
  '';

Each container service uses a wrapper script instead of inlining the podman run command. The wrapper reads the tag at start time:

backendStartScript = pkgs.writeShellScript "grapheon-backend-start" ''
    set -euo pipefail
    ${readTag}
    echo "Starting grapheon-backend with tag: $GRAPHEON_TAG"
    exec ${pkgs.podman}/bin/podman run \
      --rm \
      --name=grapheon-backend \
      --network=${grapheonNetwork} \
      --network-alias=grapheon-backend \
      --hostname=grapheon-backend \
      -v ${dataDirDb}:/data:Z \
      --env-file ${grapheonAuthEnvFile} \
      -e DATABASE_URL=sqlite:////data/network.db \
      -e APP_NAME=Grapheon \
      -e AUTH_ENABLED=True \
      -e ENFORCE_AUTH=True \
      -e JWT_ALGORITHM=HS256 \
      -e JWT_EXPIRATION_MINUTES=60 \
      --label io.containers.autoupdate=registry \
      ${backendImageBase}:$GRAPHEON_TAG
'';

Notice there's no -e APP_VERSION being injected. That was actually a bug in my first iteration — the nix config was passing the hardcoded version as an env var, which overrode whatever version the container image had baked in. The backend's config.py already reads a VERSION file from inside the container, so we just let it do its thing.

Also notice the auth-related environment variables and --env-file — Grapheon picked up OIDC and local admin authentication along the way, and those credentials get injected from a separate env file on the host that the NixOS activation script creates on first deploy.

The Auto-Update Script (Daily Timer)

The NixOS module has an inline auto-update script for the daily timer that handles the GHCR query and image pull:

autoUpdateScript = pkgs.writeShellScript "grapheon-auto-update" ''
    set -euo pipefail

    # GHCR requires an anonymous bearer token even for public images
    token="$(${pkgs.curl}/bin/curl -fsSL \
      'https://ghcr.io/token?scope=repository:badgerops/grapheon-backend:pull' \
      | ${pkgs.jq}/bin/jq -r '.token' \
    )"

    latest_tag="$(${pkgs.curl}/bin/curl -fsSL \
      -H "Authorization: Bearer $token" \
      https://ghcr.io/v2/badgerops/grapheon-backend/tags/list \
      | ${pkgs.jq}/bin/jq -r '.tags[]' \
      | ${pkgs.gnugrep}/bin/grep -E '^v[0-9]+\.[0-9]+\.[0-9]+$' \
      | ${pkgs.coreutils}/bin/sort -V \
      | ${pkgs.coreutils}/bin/tail -n1 \
    )"

    # Compare against what we're running
    ${readTag}
    if [ "$GRAPHEON_TAG" = "$latest_tag" ]; then
      echo "Already running $latest_tag — nothing to do"
      exit 0
    fi

    # Pull both images
    ${pkgs.podman}/bin/podman pull ${backendImageBase}:$latest_tag
    ${pkgs.podman}/bin/podman pull ${frontendImageBase}:$latest_tag

    # Persist the new tag — services read this on next start
    ${pkgs.coreutils}/bin/echo "$latest_tag" > "${versionFile}"

    # Restart picks up the new tag from the state file
    systemctl restart podman-grapheon-backend.service
    systemctl restart podman-grapheon-frontend.service
'';

Note the fully-qualified Nix store paths (${pkgs.curl}/bin/curl instead of bare curl) — this is how NixOS scripts ensure they use the exact versions of tools declared in the system configuration, not whatever happens to be on $PATH.

The Upgrade Handler Script (grapheon-upgrade.sh)

The in-app upgrade trigger runs through a more evolved path than the daily auto-update. The Grapheon repo includes a standalone scripts/grapheon-upgrade.sh that implements a five-step upgrade process with granular status reporting:

#!/usr/bin/env bash
# grapheon-upgrade.sh — Host-level upgrade watcher script
set -euo pipefail

DATA_DIR="${DATA_DIR:-/data}"
REQUEST_FILE="${DATA_DIR}/upgrade-requested"
STATUS_FILE="${DATA_DIR}/upgrade-status.json"
BACKUP_DIR="${DATA_DIR}/backups"
HEALTH_URL="http://localhost:8000/api/health"
TOTAL_STEPS=5

The script reads separate backend and frontend versions from the trigger file, with backward-compat fallback to the old single target_version field:

read -r BACKEND_VERSION FRONTEND_VERSION < <(python3 -c "
import json, sys
try:
    data = json.load(open('${REQUEST_FILE}'))
    bv = data.get('target_backend_version', data.get('target_version', ''))
    fv = data.get('target_frontend_version', bv)
    print(bv, fv)
except Exception as e:
    print('', file=sys.stderr)
    sys.exit(1)
" 2>/dev/null || echo "")

Then it runs through five steps, writing structured JSON status after each one so the frontend can track progress:

Step 1: Back up data. Before touching anything, tar up the SQLite database, WAL, config, and env files to /data/backups/grapheon-backup-YYYY-MM-DD-HHMMSS.tar.gz. If the upgrade goes sideways, you've got a snapshot.

Step 2: Pull backend image. podman pull ghcr.io/badgerops/grapheon-backend:v${BACKEND_VERSION} with a 5-minute timeout.

Step 3: Pull frontend image. podman pull ghcr.io/badgerops/grapheon-frontend:v${FRONTEND_VERSION} — pulled separately with its own version tag.

Step 4: Restart services. systemctl restart both containers.

Step 5: Health check. Curl the /api/health endpoint every second for up to 30 seconds. If it never responds, the upgrade is marked as failed.

Each step writes a status update that includes step progress:

write_status() {
    local status="$1" step="$2" msg="$3"
    local progress=$(( (step * 100) / TOTAL_STEPS ))
    [[ "${status}" == "completed" ]] && progress=100
    cat > "${STATUS_FILE}" <<EOF
{
    "status": "${status}",
    "message": "${msg}",
    "step": ${step},
    "total_steps": ${TOTAL_STEPS},
    "progress": ${progress},
    "updated_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
EOF
}

The GHCR Authentication Saga

This one bit me. My first version of the script just curled the GHCR v2 API directly:

curl -fsSL https://ghcr.io/v2/badgerops/grapheon-backend/tags/list

401 Unauthorized. Even though the images are public.

Turns out GHCR implements the Docker Registry v2 authentication spec, which requires a token exchange even for anonymous access to public repositories. You have to:

  1. Hit https://ghcr.io/token?scope=repository:OWNER/REPO:pull to get an anonymous bearer token
  2. Pass that token as Authorization: Bearer $token on subsequent API calls

One of those things that makes total sense in retrospect but is completely non-obvious when you're staring at a 401 from a public registry at 11pm.

Systemd Path Unit: The Glue

The bridge between "container wrote a file" and "host pulls new images" is a systemd path unit:

systemd.paths.grapheon-upgrade-trigger = {
    description = "Watch for Grapheon in-app upgrade request";
    wantedBy = [ "paths.target" ];
    pathConfig = {
        PathExists = "${dataDirDb}/upgrade-requested";
        Unit = "grapheon-upgrade-watcher.service";
    };
};

When /srv/grapheon/data/upgrade-requested appears, systemd activates the grapheon-upgrade-watcher service, which runs the upgrade handler. In the NixOS module, this is currently an inline wrapper that calls the auto-update script with status reporting:

upgradeHandlerScript = pkgs.writeShellScript "grapheon-upgrade-handler" ''
    set -euo pipefail

    STATUS_FILE="${dataDirDb}/upgrade-status.json"
    TRIGGER_FILE="${dataDirDb}/upgrade-requested"

    write_status() {
      ${pkgs.coreutils}/bin/echo "$1" > "$STATUS_FILE"
    }

    write_status "{\"status\":\"running\",\"message\":\"Pulling latest images from GHCR...\",\"started_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"

    # Remove trigger so the path unit re-arms
    ${pkgs.coreutils}/bin/rm -f "$TRIGGER_FILE"

    if ${autoUpdateScript}; then
      write_status "{\"status\":\"completed\",\"message\":\"Upgrade completed successfully.\",\"completed_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"
    else
      write_status "{\"status\":\"failed\",\"message\":\"Auto-update script exited with an error. Check journalctl -u grapheon-upgrade-watcher for details.\",\"completed_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"
    fi
'';

The next step is migrating this inline handler to call grapheon-upgrade.sh instead, which would bring the NixOS in-app upgrade path in line with the standalone script's five-step backup-pull-restart-healthcheck flow. For now, the daily timer uses the inline script (which queries GHCR for the latest tag), and the in-app path uses the wrapper above.

There's also a daily timer for unattended updates:

systemd.timers.grapheon-auto-update = {
    description = "Daily Grapheon auto-update check";
    wantedBy = [ "timers.target" ];
    timerConfig = {
        OnCalendar = "daily";
        Unit = "grapheon-auto-update.service";
        Persistent = true;
    };
};

This runs the auto-update script on a schedule, so even if nobody is looking at the UI, the deployment stays current.

Two Tag Schemes

One thing worth calling out: there are two different tag patterns in play.

GHCR container tags use a plain v prefix: v0.3.0, v0.8.7. The NixOS daily auto-update script queries these from the GHCR v2 tags API and filters for ^v[0-9]+\.[0-9]+\.[0-9]+$. Both backend and frontend containers get tagged with the same version when CI publishes them.

GitHub release tags use component prefixes: backend-v0.8.7, frontend-v0.9.1. The in-app update check queries these from the GitHub Releases API. Backend and frontend can version independently here — the frontend might be at v0.9.1 while the backend is at v0.8.7.

The in-app upgrade writes the separate versions to the trigger file, and the upgrade handler pulls each image with its correct tag. The daily auto-update uses a single GHCR tag for both. This works because CI publishes matching images to GHCR under the unified tag, even while GitHub releases track them separately.

The Full Picture

Here's the flow when someone clicks "Upgrade now":

  1. Frontend calls POST /api/updates/upgrade
  2. Backend writes /data/upgrade-requested with separate target_backend_version and target_frontend_version fields (bind-mounted from host)
  3. Systemd path unit detects the file, activates the upgrade handler
  4. Handler writes {"status":"running","step":1,"total_steps":5,"progress":20}Step 1: Backing up data
  5. Handler creates a tar.gz backup of the database and config files
  6. Handler deletes the trigger file (re-arms the path unit)
  7. Step 2: Pull backend image from GHCR with the backend version tag
  8. Step 3: Pull frontend image from GHCR with the frontend version tag
  9. Step 4: Restart both Podman services
  10. Step 5: Health check — poll /api/health until it responds (up to 30s)
  11. Handler writes {"status":"completed","step":5,"total_steps":5,"progress":100}
  12. Frontend polls GET /api/updates/status, renders the progress bar and step timeline, sees "completed," auto-refreshes
  13. User sees the new version. Hopefully.

If any step fails, the handler writes {"status":"failed","step":N} with a message, and the frontend shows the error with a retry button.

The entire IPC is two files on a shared volume. No message queues, no sockets, no D-Bus. Just a trigger file and a status file. The container doesn't need any special privileges, and the host-side scripts are pure NixOS — fully declarative, reproducible, and auditable.

Lessons Learned

GHCR auth is not optional. Even for public images, you need the token dance. Don't assume anonymous access works like Docker Hub.

File-based IPC is underrated. systemd path units are built for exactly this kind of thing. They're reliable, they re-arm automatically, and they require zero custom infrastructure.

A version state file beats hardcoded tags. Instead of pinning an image tag in your nix config and retagging at runtime (gross) or requiring a nixos-rebuild for every release, just store the current tag in a file on disk. The service wrapper reads it at start time, and the auto-update script writes it before restarting. The nix module's defaultTag is only there for first boot.

Don't override what the container already knows. My first pass had the nix config injecting -e APP_VERSION=0.1.0 into the container. The backend already reads its version from a VERSION file baked into the image, but the env var trumped it. So even after a successful upgrade to v0.3.0, the UI still showed 0.1.0. The fix was just... deleting the env var and letting the container report its own version.

Version your components independently. The backend and frontend don't always change in lockstep. Early on, I used a single version tag for both, which caused "manifest unknown" errors when they diverged. Now the trigger file carries target_backend_version and target_frontend_version separately, and the upgrade script pulls each with its own tag. The fallback to the old single target_version field keeps it backward-compatible.

Back up before you upgrade. The pre-upgrade backup was added after a close call. It's a tar.gz of the SQLite database, WAL file, and config — takes milliseconds and gives you a rollback point. Cheap insurance.

Health checks close the loop. The original version would restart the services and call it done. But "services restarted" doesn't mean "services healthy." Adding a 30-second health check loop that polls /api/health catches startup failures that would otherwise silently leave the app down.

Show progress, not just status. The first version just showed "Upgrading..." with a spinner. Now the frontend renders a progress bar, a step counter ("Step 3/5"), and a visual timeline with checkmarks. Users don't wonder if it's stuck anymore.

Cache with a fallback. The 1-hour cache on GitHub API responses means we're not rate-limited, and falling back to stale cache on API failure means the update check never hard-fails from the user's perspective.

Status polling is fine. I considered WebSockets for the upgrade progress, but polling every 5 seconds is plenty responsive for an operation that takes 30-60 seconds. Keep it simple.

The whole thing has grown from the initial implementation to roughly 500 lines of Python, 550 lines of React, 170 lines of standalone shell script, and 300 lines of Nix. Not bad for a feature that means I never have to SSH in to deploy again.

This grew from just a whim 'hey, how can I make this auto-update' to 'hey, it would be cool if I could do in-app updates' to the current iteration. I think I'll create a git repo & blog post around how to implement this in your application, for ease of reference.

All for now,

-BadgerOps