BadgerOps

Treating OpenClaw Like a Junior Sysadmin

BadgerOps — Wed, 20 May 2026 20:10:59 GMT

I wanted to kick the tires on OpenClaw, but I did not want to install it directly on my primary workstation.

That's not a knock on OpenClaw specifically. It is just the normal posture I try to keep with new automation tools, especially ones that are designed to sit close to my workflow, read context, talk to services, and take actions on my behalf. Before I hand something the keys to my actual daily-driver environment, I want to see how it behaves in a smaller, more boring box.

So the question became: how do I evaluate an AI companion without turning it into a trusted endpoint?

The answer I settled on was to treat it like a junior sysadmin.

Give it an identity. Give it a "workstation". Give it limited access to logs, metrics, chat, and a couple of source-control systems. Let it observe. Let it help troubleshoot. Do not give it my shell, my browser profile, my SSH agent, or a wide-open view of the homelab.

The First Attempt

The first pass was a small VM on one of my internal servers: guiltyspark. That made sense at the time: isolated guest, disposable disk, a narrow network path, and a firewall policy that only allowed what I explicitly needed. The VM was built around a Debian guest with OpenClaw bootstrapped inside it, and the network policy was intentionally grumpy about internal access. Matrix was allowed. The rest of RFC1918 was mostly not.

# This was the original guiltyspark VM shape.
# The bridge could talk to itself, and Matrix was allowed.
# Everything else in RFC1918 got rejected before the final allow.

${iptables} -I FORWARD 1 \
  -i ${openclaw.network.bridge} \
  -o ${openclaw.network.bridge} \
  -j ACCEPT

${iptables} -I FORWARD 2 \
  -i ${openclaw.network.bridge} \
  -d ${openclaw.matrix.ip}/32 \
  -p tcp -m multiport --dports 80,443 \
  -j ACCEPT

${iptables} -I FORWARD 3 -i ${openclaw.network.bridge} -d 10.0.0.0/8 -j REJECT
${iptables} -I FORWARD 4 -i ${openclaw.network.bridge} -d 172.16.0.0/12 -j REJECT
${iptables} -I FORWARD 5 -i ${openclaw.network.bridge} -d 192.168.0.0/16 -j REJECT

# After the explicit internal rejects, normal outbound traffic was okay.
${iptables} -I FORWARD 6 -i ${openclaw.network.bridge} -j ACCEPT

                   allowed: tcp/80,443
+---------------+ ------------------------> +-------------------+
| openclaw VM   |                           | matrix.badger.lan |
| Debian guest  |                           | Synapse           |
+---------------+                           +-------------------+
        |
        | blocked: 10/8, 172.16/12, 192.168/16
        v
+---------------------------------------------------------------+
| the rest of the internal network                              |
+---------------------------------------------------------------+

That worked well as a security model, but it did not work well as an evaluation environment. The VM (1 core, 4gb ram) felt cramped, and I wanted more horsepower without moving the experiment onto my primary workstation.

I had a spare laptop sitting around and deployed NixOS, then as a Halo nerd, named it cortana. Good enough. It had actual hardware, it was not my main machine, and it was now part of the NixOS fleet. So Cortana became the OpenClaw host.

Cortana Becomes the Intern Desk

The first step was, obviously, infrastructure work - Cortana got a real DNS name, cortana.badger.lan, and a static management address from the internal pool.

# common/system/common.nix
# Keep hostnames and management addresses in one shared inventory.

hostnames.cortana = "cortana.badger.lan";

hosts.cortana = {
  mgmt = "10.170.0.117"; # get it? 117? ok...
};

# modules/infrastructure/coredns.nix
# CoreDNS then renders the LAN A record from that inventory.

cortana IN A ${infraCommon.hosts.cortana.mgmt}

# hosts/cortana/networking.nix
# The host gets the same address statically, instead of relying on DHCP luck.

addresses = "${common.hosts.cortana.mgmt}/24";

Secrets are handled through SOPS. The rendered OpenClaw configuration lands in /home/cortana/.openclaw/openclaw.json with the Matrix credentials and gateway token injected by the system configuration. No room IDs, passwords, or tokens are baked into the repo.

# hosts/cortana/openclaw.nix
# These secret names exist in Git, but the values live encrypted in SOPS.

sops.secrets = {
  "openclaw/matrix-user-id" = { };
  "openclaw/matrix-password" = { };
  "openclaw/matrix-room-id" = { };
  "openclaw/gateway-token" = { };
};

# Render the runtime config as the cortana user.
# The actual token/password placeholders are replaced by sops-nix at activation time.

sops.templates."openclaw.json" = {
  owner = "cortana";
  group = "users";
  mode = "0400";
};

The OpenClaw gateway itself is not running as a user systemd service. That was one of the early pain points. On NixOS, the installer expected a more conventional Linux environment, then we ran into user-bus and package-manager assumptions. The quickstart wanted Node. The Node installer wanted a package manager it recognized. The gateway service wanted systemd user-bus behavior that was not present in the way I was invoking it. Then rootless container namespace setup had its own newuidmap complaints after a reboot.

None of those problems were individually shocking. They were just enough friction to make the native NixOS path feel like the wrong thing to evaluate first.

This is primarily because I chose to deploy inside of Distrobox. Cortana is still NixOS, but OpenClaw runs inside a Fedora Distrobox named openclaw because... it's obvious?. The NixOS module owns the outer lifecycle, creates the box if needed, installs the normal Fedora-side dependencies, runs the OpenClaw installer there, and starts the gateway under a system service on the host.

# Host: NixOS
# Container userland: Fedora via Distrobox
# OpenClaw sees a managable Linux host it expects; NixOS still owns the service lifecycle.

boxName = "openclaw";
boxImage = "quay.io/fedora/fedora:latest";

if ! podman container exists ${boxName}; then
  distrobox-create \
    --yes \
    --name ${boxName} \
    --image ${boxImage} \
    --additional-packages "nodejs npm make gcc gcc-c++ cmake python3 chromium git curl tar gzip xz which procps-ng diffutils findutils"
fi

distrobox-enter --name ${boxName} -- bash -lc '
  set -euo pipefail

  # Keep npm global installs under the cortana home directory.
  mkdir -p ${npmPrefix}
  npm config set prefix ${npmPrefix}
  export PATH="${npmPrefix}/bin:/usr/local/bin:/usr/bin:/bin:$PATH"

  # Let OpenClaw install itself in the Fedora userland.
  if ! openclaw --version >/dev/null 2>&1; then
    curl -fsSL https://openclaw.ai/install.sh | bash
  fi
'

+-------------------------------------------------------------+
| cortana.badger.lan                                          |
| NixOS                                                       |
|                                                             |
|  systemd: openclaw-gateway.service                          |
|      |                                                      |
|      v                                                      |
|  distrobox-enter openclaw                                   |
|      |                                                      |
|      v                                                      |
|  Fedora userland: node, npm, chromium, openclaw             |
|                                                             |
|  gateway bind: 127.0.0.1:18789                              |
+-------------------------------------------------------------+

# The host service is intentionally simple.
# NixOS owns start/stop/restart. OpenClaw runs as cortana.

systemd.services.openclaw-gateway = {
  description = "OpenClaw Gateway (Distrobox)";
  wantedBy = [ "multi-user.target" ];
  wants = [ "network-online.target" ];
  after = [ "network-online.target" ];

  serviceConfig = {
    User = "cortana";
    Group = "users";
    WorkingDirectory = "/home/cortana";
    Type = "simple";
    ExecStartPre = openclawBootstrap;
    ExecStart = openclawGatewayStart;
    ExecStop = openclawGatewayStop;
    Restart = "always";
    RestartSec = "10s";
  };
};

The gateway binds to loopback on port 18789. If I want the dashboard, I SSH tunnel to Cortana and open it locally (for now - eventually it may join the rest of my internal services on Homepage. OpenClaw also gets a host-side wrapper, so from Cortana I can run openclaw ... and have it enter the Distrobox with the right PATH and internal CA trust.

# Local workstation
# Nothing exposed on the LAN; the dashboard is reached through SSH.

ssh -N -L 18789:127.0.0.1:18789 cortana@cortana.badger.lan

# Then open this locally:
# http://localhost:18789/

Identity, Not My Identity

The important bit for me was that OpenClaw should not be me.

I created separate identities for it: a GitHub profile and a Forgejo profile for my internal git service. Those accounts are intentionally scoped as service-style users. They can be invited to the things I want them to see, and they can be removed or rotated without touching my personal credentials.

# Conceptually, this is the access model I wanted:

human:badgerops
  - primary workstation
  - normal SSH agent
  - broad repo/admin access
  - browser sessions and personal tokens

assistant:openclaw
  - host: cortana
  - Unix user: cortana
  - GitHub user: 
  - Forgejo user: cortana-bot
  - Matrix user: cortana
  - access: selected rooms, selected repos, selected logs/metrics

That distinction matters. If this is supposed to act like a junior sysadmin, it should have an account like a junior sysadmin. Not my browser cookies. Not my workstation SSH agent. Not my github or forgejo or ${service} token because that is easiest. A little extra work = a little less future me pain. Maybe.

Right now the goal is not to let it autonomously administer everything. The goal is to let it participate in the troubleshooting loop with enough context to be useful.

Matrix As The Control Plane

The chat side is Matrix, because that is already where my internal alerting and operational chat lives.

OpenClaw is configured against my internally hosted Synapse instance at matrix.badger.lan. The Matrix plugin is enabled, encrypted rooms are enabled, and direct messages use pairing. Group access is allowlisted, not open. That means the bot does not get to roam through arbitrary rooms just because it exists on the Matrix homeserver.

# Rendered shape of the Matrix channel config.
# Values shown as placeholders are injected from SOPS.

"channels": {
  "matrix": {
    "enabled": true,
    "homeserver": "https://matrix.badger.lan",
    "network": {
      "dangerouslyAllowPrivateNetwork": true # it's _on_ a private network
    },
    "userId": "${openclaw/matrix-user-id}",
    "password": "${openclaw/matrix-password}",
    "deviceName": "OpenClaw Gateway",
    "encryption": true,
    "dm": {
      "policy": "pairing"
    },
    "groupPolicy": "allowlist",
    "groups": {
      "${openclaw/matrix-room-id}": {
        "enabled": true
      }
    },
    "autoJoin": "allowlist",
    "autoJoinAllowlist": [
      "${openclaw/matrix-room-id}"
    ]
  }
}

For the first real room, I added it to the internal infra room. That is also where the Alertmanager-to-Matrix relay can post alerts, so OpenClaw can see the same operational noise I would normally react to: service health, host issues, OpenClaw's own gateway state, and the other homelab monitoring signals.

+----------------+       webhook        +--------------------------+
| Prometheus     | ------------------>  | Alertmanager             |
| alert rules    |                      |                          |
+----------------+                      +------------+-------------+
                                                   |
                                                   | Matrix relay
                                                   v
                                        +--------------------------+
                                        | #infra Matrix room       |
                                        | humans + OpenClaw        |
                                        +--------------------------+

There was a small gotcha here: pairing requests can time out. I spent a bit staring at the CLI saying there were no pending Matrix pairings while the chat UI still showed one. Eventually I realized: it had expired. Start over, pair again, move on.

Observability For The Assistant

If I am going to run a helper that is supposed to help with operations, I need to be able to observe the helper too.

The Cortana module exports OpenClaw health into the Node Exporter textfile collector once per minute. It runs openclaw health --json as the cortana user and converts the output into Prometheus metrics. That gives me scrape success, gateway health, event loop delay, event loop utilization, agent sessions, and Matrix channel state.

# Timer: refresh the OpenClaw textfile metrics every minute.

systemd.timers.openclaw-node-exporter-textfile = {
  wantedBy = [ "timers.target" ];
  timerConfig = {
    OnBootSec = "2m";
    OnUnitActiveSec = "1m";
    Unit = "openclaw-node-exporter-textfile.service";
  };
};

# Collector: ask OpenClaw for health JSON as the cortana user.

health_json="$(${pkgs.util-linux}/bin/runuser -u cortana -- ${openclawWrapper}/bin/openclaw health --json 2>/dev/null)"

# Example output written to node_exporter's textfile collector.
# Prometheus scrapes this like any other node metric.

openclaw_scrape_success 1
openclaw_gateway_health_ok 1
openclaw_event_loop_delay_p99_milliseconds 12
openclaw_event_loop_utilization 0.02
openclaw_channel_connected{channel="matrix",account="default"} 1

Prometheus scrapes Cortana's node exporter with role=openclaw. Alert rules watch for stale OpenClaw metrics, failed health scrapes, an unhealthy gateway, and disconnected channels. There is also a Grafana dashboard for the gateway, because eventually every little homelab experiment needs a dashboard. Apparently this is the law.

# Prometheus target. Cortana is just another node_exporter scrape,
# but it gets role=openclaw so dashboards and alerts can filter cleanly.

- targets: ['${globalCommon.hosts.cortana.mgmt}:9100']
  labels:
    host: cortana
    role: openclaw

# A few of the OpenClaw-specific alerts.
# These are less about paging me immediately and more about proving the assistant itself is observable.

- alert: OpenClawMetricsStale
  expr: (time() - openclaw_scrape_timestamp_seconds{job="node-exporter",host="cortana"}) > 300
  for: 5m
  labels:
    severity: warning
    category: openclaw

- alert: OpenClawGatewayUnhealthy
  expr: openclaw_gateway_health_ok{job="node-exporter",host="cortana"} == 0
  for: 5m
  labels:
    severity: critical
    category: openclaw

- alert: OpenClawChannelDisconnected
  expr: openclaw_channel_connected{job="node-exporter",host="cortana"} == 0
  for: 5m
  labels:
    severity: warning
    category: openclaw

Logs go through Vector into Loki. OpenClaw writes JSON-ish logs under the Cortana user's home and temporary runtime paths, Vector normalizes the useful fields, and Loki labels them with host, service, unit, source, and level. One minor tradeoff: Vector runs as root on Cortana (for other host log reasons). I am not thrilled by that, but it is contained to this host and may be worth revisiting later.

# Vector reads both the temporary OpenClaw logs and the durable user logs.

sources.openclaw_files = {
  type = "file";
  include = [
    "/tmp/openclaw/openclaw-*.log"
    "/home/cortana/.openclaw/logs/*.jsonl"
  ];
  read_from = "end";
};

# Normalize the labels before sending to Loki.
# This makes queries like {service="openclaw",host="cortana"} work cleanly.

sinks.openclaw_loki = {
  type = "loki";
  endpoint = "http://loki.badger.lan";
  labels = {
    host = "{{ host }}";
    service = "{{ service }}";
    source = "{{ source_type }}";
    unit = "{{ unit }}";
    level = "{{ level }}";
  };
};

What Worked

The model feels right so far.

OpenClaw is not sitting on my workstation. It does not inherit my local trust. It has its own host, its own account, its own chat identity, and its own source-control identities. It can see selected operational context, especially Matrix alerts and the logs/metrics I choose to expose.

Distrobox ended up being the right compromise for this stage. I still get a declarative NixOS host managing the lifecycle, DNS, users, SOPS secrets, systemd service, Prometheus integration, Vector integration, and Grafana provisioning. OpenClaw gets a normal Fedora-ish userland where its installer and Node assumptions are much less weird.

I’ve also been using Cortana as a lightweight engineering assistant around my GitHub and Forgejo work. The first pattern so far is automated clawpatch review: she picks an active repo, runs a read-only code review pass, turns the findings into a Markdown report, commits that report into a dedicated archive repo, and sends me the summary in Matrix.

cron schedule
  -> choose an active (commits in last year) BadgerOps repo
  -> run clawpatch review
  -> save Markdown report
  -> commit report to forgejo:cortanabot/clawpatch-reports
  -> send summary to Matrix

The important part is that this is review-only automation. It does not push branches, open PRs, or publish changes unless I explicitly ask. That keeps the loop useful without letting automation mutate production code behind my back.

What Still Needs Cleanup

The Matrix room configuration started as a single room and the current Nix template still renders a single SOPS-backed room ID into groups and autoJoinAllowlist. I have a multi-room secret shape ready, but the renderer still needs to consume that JSON before I can honestly say the room allowlist is multi-room in the deployed config.

# Current: one SOPS-backed room ID rendered into the allowlist.

"groups": {
  "${openclaw/matrix-room-id}": {
    "enabled": true
  }
}

# Next: render a SOPS-backed JSON object/list for multiple rooms.
# The secret exists, but the Nix template still needs to consume it.

openclaw/matrix-room-ids-json

I also want to tighten the source-control permissions further once I know which workflows are useful. The right shape is probably a tiny set of repos, branch-only write permissions where possible, and no credential that would be painful to rotate.

The next useful test is not whether OpenClaw can answer trivia in chat. I do not care much about that. The useful test is whether it can look at an alert, inspect the narrow set of logs and metrics I exposed, find the likely failing unit or regression, and propose a fix that I can review like any other pull request, same with codebase changes as I continue to work on projects.

The workflow I want to prove:


1. Alertmanager posts an alert into Matrix
2. OpenClaw sees the alert in an allowlisted room
3. OpenClaw checks only the logs/metrics it has access to
4. OpenClaw proposes a diagnosis and patch
5. Human reviews the change like any other PR

The Takeaway

I think this is the pattern I want for AI companion tools in my own infrastructure: do not install them into the most trusted place first. Give them a small desk, a badge with their own name on it, limited read access, and a very boring job.

In this case, that desk is Cortana. The badge is a Matrix account plus separate GitHub and Forgejo users. The boring job is watching alerts, reading selected logs and metrics, and helping me reason through operational issues.

That feels a lot better than installing a brand-new assistant directly onto my workstation and hoping the defaults line up with my threat model.

We'll see where things end up next!

All for now,

-BadgerOps

Why I Wrote a New Terraform Provider for UniFi

BadgerOps — Mon, 18 May 2026 20:51:00 GMT

I have been doing more work around local UniFi management lately, and one thing kept bothering me: there was not a Terraform workflow for UniFi that I actually wanted to use.

(Yes, there are several other UniFi providers out there - but they all basically forked the O.G. provider: https://github.com/paultyng/terraform-provider-unifi, and provide a small amount of additional feature support on top of that [now archived] provider. Moving on.)

UniFi is good when you are clicking around in one controller. It gets less good when you want repeatability. Networks, WiFi broadcasts, firewall rules, DHCP reservations, ACLs. That all turns into remembered UI state unless you put some structure around it.

So I wrote one.

The result is badgerops/unifi, a Terraform provider for UniFi Network configuration based on the actual UniFi OpenAPI Spec. It is published on the Terraform Registry, and people are already using it. As of May 20, 2026, there's been 153 downloads. That is still early, but it is enough to confirm that this was not just a problem I had.

The problem I was actually trying to solve

As a recap, I did not start this because I wanted to write a provider for the sake of writing a provider.

I started because, like any self-respecting Sysadmin, I wanted UniFi configuration to behave more like infrastructure and less like a pile of clicks. I wanted to define site state in code, review changes in a normal workflow, and stop relying on memory and screenshots to understand how a controller was configured.

I spent time looking at the existing UniFi Terraform options. The short version is not that anybody did something wrong. The original provider was based off go-unifi which was built by decompiling the original Unifi Network application Jar file, and generating code from the json files contained in that Jar. All of the other providers fork and extend on that same basic premise.

Sometime in recent history, Ubiquiti started shipping an OpenAPI spec file inside their Network application code, so I thought: “hey, let’s use that!”

I fired up my trusty friend claude and started planning the implementation, giving the existing configuration of my UniFi network, the OpenAPI spec, and a simple instruction: "Go make me a Terraform provider"
A few hours later - (and much back and forth, redirecting, hand-editing, and mild cussing) I had a functioning provider that showed 0 drift between the previous (paultyng provider) state and the current plan with my own shiny provider.

On to the nitty gritty!

What this provider is

This provider targets core UniFi Network configuration. This includes things like:

networks
WiFi broadcasts
firewall zones and policies
firewall policy ordering
traffic matching lists
DNS policies
ACL rules and ACL rule ordering
DHCP reservations

It also includes data sources for the lookup and reference objects you need to build useful configurations: sites, devices, networks, WiFi broadcasts, firewall zones and policies, VPN references, DPI applications, countries, RADIUS profiles, device tags, WANs, switch stacks, MC-LAG domains, and LAGs.

It is intentionally aimed at durable configuration state. It is not trying to be a giant wrapper around every controller action. I am much more interested in the parts of UniFi that benefit from reviewable, repeatable Terraform configuration than in one-off operational actions. (Those should still be clicks, imho)

Why I built it around the OpenAPI snapshot

A big part of the design is that the repository tracks a committed UniFi Network OpenAPI snapshot and generates the low-level client code from that.

That matters because the UniFi API story is useful, but not perfectly clean. If you want a provider to stay understandable over time, it helps a lot to anchor it to a real contract instead of a bunch of “the controller seemed to accept this last time I tried it” logic.

So the provider is shaped around the committed snapshot, and the generated client stays behind explicit translation code. That means the Terraform surface stays intentional. Fields exist because they are supported and mapped on purpose, not because raw JSON happened to leak through.

The current public release, 0.2.12, refreshed that snapshot from UniFi Network 10.2.105 to 10.3.58. That brought in support for newer WiFi schema additions, including open security encryption modes and standard-broadcast DNS assistance configuration.

This is also part of why I spent time on the gitops tooling around the provider. There is weekly automation to watch upstream UniFi package releases and flag when the committed snapshot likely needs a refresh. That does not magically make maintenance happen, but it does mean staying current is built into the project shape instead of being a vague future intention.

The API reality

In general, if UniFi (or any other service provider) exposes something in the integration API, that is the surface I want to build against.

I did not want to turn this into Terraform over random private controller endpoints.

But. DHCP reservations are useful enough - and I needed them - that I made one explicit exception instead of pretending the official surface already covered them. (Because they didn't)

The provider is integration-API-first, with one narrow legacy exception for unifi_dhcp_reservation. The reason is the current committed OpenAPI spec does not expose DHCP reservation writes. For that one resource, the provider uses the legacy local Network client database endpoint. Fingers crossed that this changes soon.

Even there, I still tried to keep the behavior narrow and explicit. The provider is not “private API everywhere.” It is a small, documented exception where the official surface is not there yet.

What it covers today

Today, the provider gives you a usable base for managing real UniFi site configuration with Terraform. It has generated docs, checked-in examples, Registry-ready release packaging, and tests around the controller behaviors that matter for these resources.

I also spent time chasing firewall behavior against real controller quirks which consumed... time... That took a while to sort out and figure a good way to report if there is a controller issue. Still trying to figure that out, but for now we just have a bunch of unit tests and log output.

The fact that it is already past its first hundred downloads is a useful sanity check. It does not make the provider mature overnight, but it does suggest there are other people who want this same shape of workflow.

A small example

This is the kind of configuration I wanted to be able to write:

terraform {
  required_providers {
    unifi = {
      source  = "badgerops/unifi"
      version = "0.2.12"
    }
  }
}

provider "unifi" {
  api_url        = var.unifi_api_url
  api_key        = var.unifi_api_key
  allow_insecure = false
}

data "unifi_site" "main" {
  name = "default"
}

resource "unifi_network" "management" {
  site_id    = data.unifi_site.main.id
  name       = "management"
  management = "GATEWAY"
  vlan_id    = 200

  ipv4_configuration = {
    host_ip_address = "10.200.0.1"
    prefix_length   = 24
    dhcp = {
      mode = "SERVER"
      range = {
        start = "10.200.0.10"
        stop  = "10.200.0.200"
      }
    }
  }
}

resource "unifi_dhcp_reservation" "switch" {
  site_id     = data.unifi_site.main.id
  mac_address = "AA:BB:CC:DD:EE:01"
  fixed_ip    = "10.200.0.25"
}

That is much closer to the workflow I wanted: declare the shape of the site, review the change, apply it, and keep going.

Sharp edges, because UniFi is still UniFi

I do not think a post like this is useful if it pretends the platform is cleaner than it is.

There are a few important caveats:

Zone-based firewall needs to be enabled in UniFi before the firewall resources work.
Controller behavior around firewall rules can be quirky, especially with some combinations of filters and action settings.
DHCP reservations still use a narrow legacy API exception because the integration API does not expose reservation writes.
The provider is for configuration workflows, not controller operations like adoption, telemetry, or device lifecycle management.

Why this project matters to me

This scratches a very real operational itch for me.

It is useful in my homelab. It is potentially useful in client work. It is also a good exercise in building something carefully around an API surface that is useful, even if imperfect.

I enjoyed building something that actually solved a problem for people-other-than-me, and, it was good to put my AI friend to work on something other than a task manager (heh)

What is next

The short list from here is pretty straightforward:

keep expanding resource coverage where the integration API makes sense for Terraform
keep testing against real controller behavior instead of assuming the docs tell the whole story (Anyone got a large deployment and want to play?)
reduce surprises around UniFi firewall behavior
move DHCP reservations back to the official API if UniFi exposes that path later
keep the provider aligned with future upstream snapshot changes

The weekly upstream check is part of that maintenance story. It is the best way I could think of that keeps this project from quietly drifting out of date.

Give it a try

If you use UniFi and Terraform, give it a try.

If it breaks in an interesting way, definitely tell me.

-BadgerOps

GitHub: https://github.com/BadgerOps/terraform-provider-unifi
Terraform Registry: https://registry.terraform.io/providers/BadgerOps/unifi/latest

Building an In-App Auto-Updater for a Containerized NixOS Deployment

BadgerOps — Tue, 10 Feb 2026 16:16:00 GMT

I've been building Grapheon, a network graph analysis and visualization tool, and deploying it as a pair of Podman containers on NixOS. The stack is pretty straightforward: a FastAPI backend, a React frontend, both published as container images to GitHub Container Registry (GHCR), and wired together with a NixOS module that manages everything through systemd services.

It works great. But every time I cut a new release I'd have to SSH into the box, pull the new images, restart the services. Not exactly the "set it and forget it" experience I was going for.

So I decided to build an in-app auto-updater — the kind where you see a little banner saying "hey, there's a new version" and you click a button and it just... upgrades itself. Sounds simple enough, right?

What started as a straightforward trigger-file mechanism has evolved through a dozen iterations into something with pre-upgrade backups, separate backend/frontend version tracking, step-by-step progress reporting, and post-upgrade health checks. Here's where it stands today.

The Problem

Here's the thing about containerized deployments: the app running inside the container can't exactly restart itself. It doesn't have access to Podman, it doesn't know about systemd, and it definitely shouldn't be pulling its own replacement image. The container is a fish that needs to convince the ocean to swap it out for a different fish.

So the architecture needed two halves:

The app side — the backend checks GitHub for new releases, the frontend shows a banner and lets you click "upgrade"
The NixOS side — a systemd path unit watches for a trigger file, kicks off an upgrade handler that backs up data, pulls new images from GHCR, restarts services, and verifies the health of the new deployment

The container talks to the host through a shared volume. That's it. A file on disk is the entire IPC mechanism. Sometimes the simplest approach is the right one.

The App Side

Backend: Checking for Updates

The backend has an /api/updates router with three endpoints:

GET /api/updates — checks GitHub Releases API for the latest version
POST /api/updates/upgrade — writes a trigger file to kick off the host-side upgrade
GET /api/updates/status — reads the upgrade status file so the frontend can poll progress

One thing that matters here: Grapheon uses separate release tags for backend and frontend — backend-v0.8.7 and frontend-v0.9.1, for example. The two components version independently, so the update check has to handle each one:

def _extract_latest_versions(releases: list[dict]) -> tuple[Optional[str], Optional[str]]:
    """
    Extract the latest backend and frontend version tags from releases.
    Returns (backend_version, frontend_version) tuple.
    """
    backend_version = None
    frontend_version = None

    for release in releases:
        tag = release.get("tag_name", "")
        if release.get("prerelease", False):
            continue
        if tag.startswith("backend-v") and backend_version is None:
            backend_version = tag
        elif tag.startswith("frontend-v") and frontend_version is None:
            frontend_version = tag
        if backend_version and frontend_version:
            break

    return backend_version, frontend_version

The actual GitHub API call is straightforward — hit the releases endpoint, cache for an hour, fall back to stale cache if the API fails:

CACHE_TTL_SECONDS = 3600  # 1 hour

async def _fetch_github_releases() -> Optional[list[dict]]:
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://api.github.com/repos/BadgerOps/grapheon/releases",
            timeout=10.0,
        )
        response.raise_for_status()
        return response.json()

The version comparison is tuple-based — parse "backend-v0.8.7" into (0, 8, 7), compare with the current version, done. The parser strips the backend-v or frontend-v prefix before splitting on dots.

One fun bug I hit early on: the frontend version kept coming back as None because the update check was only reading the FRONTEND_VERSION environment variable, which isn't always set. The fix was adding a _detect_frontend_version() function that checks the env var first, then falls back to reading frontend/package.json. Small thing, but it meant the update check was silently skipping the frontend comparison entirely.

Backend: Triggering an Upgrade

When a user clicks "Upgrade now" in the UI, the backend writes a JSON file to /data/upgrade-requested. The key evolution from the initial version is that it now writes separate version fields for backend and frontend:

@router.post("/upgrade")
async def trigger_upgrade():
    # Check if already upgrading
    status_file = os.path.join(DATA_DIR, "upgrade-status.json")
    if os.path.exists(status_file):
        status_data = json.load(open(status_file))
        if status_data.get("status") == "running":
            raise HTTPException(status_code=409, detail="An upgrade is already in progress")

    # Extract separate backend/frontend target versions
    backend_tag, frontend_tag = _extract_latest_versions(releases)

    upgrade_request = {
        "requested_at": datetime.utcnow().isoformat() + "Z",
        "current_version": settings.APP_VERSION,
        "target_backend_version": target_backend_version,
        "target_frontend_version": target_frontend_version,
    }
    with open(os.path.join(DATA_DIR, "upgrade-requested"), "w") as f:
        json.dump(upgrade_request, f, indent=2)

That /data directory is a bind mount from the host's /srv/grapheon/data. So when the container writes upgrade-requested, it appears on the host filesystem. And that's where systemd picks it up.

Writing separate version fields fixed a real problem — when backend was at v0.8.6 and frontend was at v0.9.1, the old code would try to pull grapheon-frontend:v0.8.6, which didn't exist. "Manifest unknown" errors at 11pm are not fun.

The React frontend has two places where updates surface. First, an UpdateBanner component that lives at the top of the app and polls /api/updates every 60 minutes:

const POLL_INTERVAL = 60 * 60 * 1000; // 60 minutes

useEffect(() => {
    checkForUpdatesHandler();
    pollIntervalRef.current = setInterval(checkForUpdatesHandler, POLL_INTERVAL);
    return () => clearInterval(pollIntervalRef.current);
}, []);

When an update is available, you get a blue gradient banner with "What's new" (expands to show release notes) and "Upgrade now." The banner is dismissible per-version via localStorage, so it won't nag you if you choose to skip a release.

The upgrade flow goes through a few states: null → confirm → in_progress → completed (auto-refresh after 3 seconds) or error (with retry). What changed since the first version is the in_progress state — it now shows a step-by-step progress timeline instead of just a spinner:

statusPollIntervalRef.current = setInterval(async () => {
    const statusResponse = await getUpgradeStatus();

    if (statusResponse.status === 'running') {
        // upgradeProgress is now a structured object
        setUpgradeProgress({
            message: statusResponse.message,
            step: statusResponse.step,
            totalSteps: statusResponse.total_steps,
            progress: statusResponse.progress,
        });
    } else if (statusResponse.status === 'completed') {
        clearInterval(statusPollIntervalRef.current);
        setUpgradeStep('completed');
        setTimeout(() => window.location.reload(), 3000);
    } else if (statusResponse.status === 'failed') {
        clearInterval(statusPollIntervalRef.current);
        setUpgradeStep('error');
    }
}, 5000);

The UI renders an animated progress bar showing completion percentage, a "Step N/5" counter, and a visual timeline with checkmarks for completed steps, a pulsing dot for the active step, and dimmed indicators for pending steps. It's a much better experience than staring at "Upgrading..." and wondering if anything is happening.

Second, there's a "Check for Updates" button on the Settings page with a modal that shows version comparison (with separate badges for UI and API versions), release notes, release date, and a GitHub link. Same upgrade flow, just triggered manually instead of by the polling interval.

The NixOS Side

This is where the real fun begins. The NixOS module (grapheon.nix) manages everything: the Podman network, both containers, a cloudflared tunnel, authentication credentials, and the auto-update machinery.

Dynamic Tags: The Version State File

The first thing I had to sort out was how to avoid hardcoding image tags in the nix config. If the NixOS module says ghcr.io/badgerops/grapheon-backend:v0.1.0, then that's what systemd starts, and you'd need a nixos-rebuild to change it. That defeats the whole point of an auto-updater.

The solution is a version state file. The nix module defines a defaultTag (used only on first boot), but after that, everything reads from /srv/grapheon/data/current-tag:

let
  defaultTag = "v0.3.0";
  versionFile = "${dataDirDb}/current-tag";

  readTag = ''
    if [ -f "${versionFile}" ]; then
      GRAPHEON_TAG="$(${pkgs.coreutils}/bin/cat "${versionFile}" | ${pkgs.coreutils}/bin/tr -d '[:space:]')"
    else
      GRAPHEON_TAG="${defaultTag}"
    fi
  '';

Each container service uses a wrapper script instead of inlining the podman run command. The wrapper reads the tag at start time:

backendStartScript = pkgs.writeShellScript "grapheon-backend-start" ''
    set -euo pipefail
    ${readTag}
    echo "Starting grapheon-backend with tag: $GRAPHEON_TAG"
    exec ${pkgs.podman}/bin/podman run \
      --rm \
      --name=grapheon-backend \
      --network=${grapheonNetwork} \
      --network-alias=grapheon-backend \
      --hostname=grapheon-backend \
      -v ${dataDirDb}:/data:Z \
      --env-file ${grapheonAuthEnvFile} \
      -e DATABASE_URL=sqlite:////data/network.db \
      -e APP_NAME=Grapheon \
      -e AUTH_ENABLED=True \
      -e ENFORCE_AUTH=True \
      -e JWT_ALGORITHM=HS256 \
      -e JWT_EXPIRATION_MINUTES=60 \
      --label io.containers.autoupdate=registry \
      ${backendImageBase}:$GRAPHEON_TAG
'';

Notice there's no -e APP_VERSION being injected. That was actually a bug in my first iteration — the nix config was passing the hardcoded version as an env var, which overrode whatever version the container image had baked in. The backend's config.py already reads a VERSION file from inside the container, so we just let it do its thing.

Also notice the auth-related environment variables and --env-file — Grapheon picked up OIDC and local admin authentication along the way, and those credentials get injected from a separate env file on the host that the NixOS activation script creates on first deploy.

The Auto-Update Script (Daily Timer)

The NixOS module has an inline auto-update script for the daily timer that handles the GHCR query and image pull:

autoUpdateScript = pkgs.writeShellScript "grapheon-auto-update" ''
    set -euo pipefail

    # GHCR requires an anonymous bearer token even for public images
    token="$(${pkgs.curl}/bin/curl -fsSL \
      'https://ghcr.io/token?scope=repository:badgerops/grapheon-backend:pull' \
      | ${pkgs.jq}/bin/jq -r '.token' \
    )"

    latest_tag="$(${pkgs.curl}/bin/curl -fsSL \
      -H "Authorization: Bearer $token" \
      https://ghcr.io/v2/badgerops/grapheon-backend/tags/list \
      | ${pkgs.jq}/bin/jq -r '.tags[]' \
      | ${pkgs.gnugrep}/bin/grep -E '^v[0-9]+\.[0-9]+\.[0-9]+$' \
      | ${pkgs.coreutils}/bin/sort -V \
      | ${pkgs.coreutils}/bin/tail -n1 \
    )"

    # Compare against what we're running
    ${readTag}
    if [ "$GRAPHEON_TAG" = "$latest_tag" ]; then
      echo "Already running $latest_tag — nothing to do"
      exit 0
    fi

    # Pull both images
    ${pkgs.podman}/bin/podman pull ${backendImageBase}:$latest_tag
    ${pkgs.podman}/bin/podman pull ${frontendImageBase}:$latest_tag

    # Persist the new tag — services read this on next start
    ${pkgs.coreutils}/bin/echo "$latest_tag" > "${versionFile}"

    # Restart picks up the new tag from the state file
    systemctl restart podman-grapheon-backend.service
    systemctl restart podman-grapheon-frontend.service
'';

Note the fully-qualified Nix store paths (${pkgs.curl}/bin/curl instead of bare curl) — this is how NixOS scripts ensure they use the exact versions of tools declared in the system configuration, not whatever happens to be on $PATH.

The Upgrade Handler Script (grapheon-upgrade.sh)

The in-app upgrade trigger runs through a more evolved path than the daily auto-update. The Grapheon repo includes a standalone scripts/grapheon-upgrade.sh that implements a five-step upgrade process with granular status reporting:

#!/usr/bin/env bash
# grapheon-upgrade.sh — Host-level upgrade watcher script
set -euo pipefail

DATA_DIR="${DATA_DIR:-/data}"
REQUEST_FILE="${DATA_DIR}/upgrade-requested"
STATUS_FILE="${DATA_DIR}/upgrade-status.json"
BACKUP_DIR="${DATA_DIR}/backups"
HEALTH_URL="http://localhost:8000/api/health"
TOTAL_STEPS=5

The script reads separate backend and frontend versions from the trigger file, with backward-compat fallback to the old single target_version field:

read -r BACKEND_VERSION FRONTEND_VERSION < <(python3 -c "
import json, sys
try:
    data = json.load(open('${REQUEST_FILE}'))
    bv = data.get('target_backend_version', data.get('target_version', ''))
    fv = data.get('target_frontend_version', bv)
    print(bv, fv)
except Exception as e:
    print('', file=sys.stderr)
    sys.exit(1)
" 2>/dev/null || echo "")

Then it runs through five steps, writing structured JSON status after each one so the frontend can track progress:

Step 1: Back up data. Before touching anything, tar up the SQLite database, WAL, config, and env files to /data/backups/grapheon-backup-YYYY-MM-DD-HHMMSS.tar.gz. If the upgrade goes sideways, you've got a snapshot.

Step 2: Pull backend image. podman pull ghcr.io/badgerops/grapheon-backend:v${BACKEND_VERSION} with a 5-minute timeout.

Step 3: Pull frontend image. podman pull ghcr.io/badgerops/grapheon-frontend:v${FRONTEND_VERSION} — pulled separately with its own version tag.

Step 4: Restart services. systemctl restart both containers.

Step 5: Health check. Curl the /api/health endpoint every second for up to 30 seconds. If it never responds, the upgrade is marked as failed.

Each step writes a status update that includes step progress:

write_status() {
    local status="$1" step="$2" msg="$3"
    local progress=$(( (step * 100) / TOTAL_STEPS ))
    [[ "${status}" == "completed" ]] && progress=100
    cat > "${STATUS_FILE}" <


The GHCR Authentication Saga
This one bit me. My first version of the script just curled the GHCR v2 API directly:
curl -fsSL https://ghcr.io/v2/badgerops/grapheon-backend/tags/list

401 Unauthorized. Even though the images are public.
Turns out GHCR implements the Docker Registry v2 authentication spec, which requires a token exchange even for anonymous access to public repositories. You have to:

Hit https://ghcr.io/token?scope=repository:OWNER/REPO:pull to get an anonymous bearer token
Pass that token as Authorization: Bearer $token on subsequent API calls

One of those things that makes total sense in retrospect but is completely non-obvious when you're staring at a 401 from a public registry at 11pm.
Systemd Path Unit: The Glue
The bridge between "container wrote a file" and "host pulls new images" is a systemd path unit:
systemd.paths.grapheon-upgrade-trigger = {
    description = "Watch for Grapheon in-app upgrade request";
    wantedBy = [ "paths.target" ];
    pathConfig = {
        PathExists = "${dataDirDb}/upgrade-requested";
        Unit = "grapheon-upgrade-watcher.service";
    };
};

When /srv/grapheon/data/upgrade-requested appears, systemd activates the grapheon-upgrade-watcher service, which runs the upgrade handler. In the NixOS module, this is currently an inline wrapper that calls the auto-update script with status reporting:
upgradeHandlerScript = pkgs.writeShellScript "grapheon-upgrade-handler" ''
    set -euo pipefail

    STATUS_FILE="${dataDirDb}/upgrade-status.json"
    TRIGGER_FILE="${dataDirDb}/upgrade-requested"

    write_status() {
      ${pkgs.coreutils}/bin/echo "$1" > "$STATUS_FILE"
    }

    write_status "{\"status\":\"running\",\"message\":\"Pulling latest images from GHCR...\",\"started_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"

    # Remove trigger so the path unit re-arms
    ${pkgs.coreutils}/bin/rm -f "$TRIGGER_FILE"

    if ${autoUpdateScript}; then
      write_status "{\"status\":\"completed\",\"message\":\"Upgrade completed successfully.\",\"completed_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"
    else
      write_status "{\"status\":\"failed\",\"message\":\"Auto-update script exited with an error. Check journalctl -u grapheon-upgrade-watcher for details.\",\"completed_at\":\"$(${pkgs.coreutils}/bin/date -Iseconds)\"}"
    fi
'';

The next step is migrating this inline handler to call grapheon-upgrade.sh instead, which would bring the NixOS in-app upgrade path in line with the standalone script's five-step backup-pull-restart-healthcheck flow. For now, the daily timer uses the inline script (which queries GHCR for the latest tag), and the in-app path uses the wrapper above.
There's also a daily timer for unattended updates:
systemd.timers.grapheon-auto-update = {
    description = "Daily Grapheon auto-update check";
    wantedBy = [ "timers.target" ];
    timerConfig = {
        OnCalendar = "daily";
        Unit = "grapheon-auto-update.service";
        Persistent = true;
    };
};

This runs the auto-update script on a schedule, so even if nobody is looking at the UI, the deployment stays current.
Two Tag Schemes
One thing worth calling out: there are two different tag patterns in play.
GHCR container tags use a plain v prefix: v0.3.0, v0.8.7. The NixOS daily auto-update script queries these from the GHCR v2 tags API and filters for ^v[0-9]+\.[0-9]+\.[0-9]+$. Both backend and frontend containers get tagged with the same version when CI publishes them.
GitHub release tags use component prefixes: backend-v0.8.7, frontend-v0.9.1. The in-app update check queries these from the GitHub Releases API. Backend and frontend can version independently here — the frontend might be at v0.9.1 while the backend is at v0.8.7.
The in-app upgrade writes the separate versions to the trigger file, and the upgrade handler pulls each image with its correct tag. The daily auto-update uses a single GHCR tag for both. This works because CI publishes matching images to GHCR under the unified tag, even while GitHub releases track them separately.
The Full Picture
Here's the flow when someone clicks "Upgrade now":

Frontend calls POST /api/updates/upgrade
Backend writes /data/upgrade-requested with separate target_backend_version and target_frontend_version fields (bind-mounted from host)
Systemd path unit detects the file, activates the upgrade handler
Handler writes {"status":"running","step":1,"total_steps":5,"progress":20} — Step 1: Backing up data
Handler creates a tar.gz backup of the database and config files
Handler deletes the trigger file (re-arms the path unit)
Step 2: Pull backend image from GHCR with the backend version tag
Step 3: Pull frontend image from GHCR with the frontend version tag
Step 4: Restart both Podman services
Step 5: Health check — poll /api/health until it responds (up to 30s)
Handler writes {"status":"completed","step":5,"total_steps":5,"progress":100}
Frontend polls GET /api/updates/status, renders the progress bar and step timeline, sees "completed," auto-refreshes
User sees the new version. Hopefully.

If any step fails, the handler writes {"status":"failed","step":N} with a message, and the frontend shows the error with a retry button.
The entire IPC is two files on a shared volume. No message queues, no sockets, no D-Bus. Just a trigger file and a status file. The container doesn't need any special privileges, and the host-side scripts are pure NixOS — fully declarative, reproducible, and auditable.
Lessons Learned
GHCR auth is not optional. Even for public images, you need the token dance. Don't assume anonymous access works like Docker Hub.
File-based IPC is underrated. systemd path units are built for exactly this kind of thing. They're reliable, they re-arm automatically, and they require zero custom infrastructure.
A version state file beats hardcoded tags. Instead of pinning an image tag in your nix config and retagging at runtime (gross) or requiring a nixos-rebuild for every release, just store the current tag in a file on disk. The service wrapper reads it at start time, and the auto-update script writes it before restarting. The nix module's defaultTag is only there for first boot.
Don't override what the container already knows. My first pass had the nix config injecting -e APP_VERSION=0.1.0 into the container. The backend already reads its version from a VERSION file baked into the image, but the env var trumped it. So even after a successful upgrade to v0.3.0, the UI still showed 0.1.0. The fix was just... deleting the env var and letting the container report its own version.
Version your components independently. The backend and frontend don't always change in lockstep. Early on, I used a single version tag for both, which caused "manifest unknown" errors when they diverged. Now the trigger file carries target_backend_version and target_frontend_version separately, and the upgrade script pulls each with its own tag. The fallback to the old single target_version field keeps it backward-compatible.
Back up before you upgrade. The pre-upgrade backup was added after a close call. It's a tar.gz of the SQLite database, WAL file, and config — takes milliseconds and gives you a rollback point. Cheap insurance.
Health checks close the loop. The original version would restart the services and call it done. But "services restarted" doesn't mean "services healthy." Adding a 30-second health check loop that polls /api/health catches startup failures that would otherwise silently leave the app down.
Show progress, not just status. The first version just showed "Upgrading..." with a spinner. Now the frontend renders a progress bar, a step counter ("Step 3/5"), and a visual timeline with checkmarks. Users don't wonder if it's stuck anymore.
Cache with a fallback. The 1-hour cache on GitHub API responses means we're not rate-limited, and falling back to stale cache on API failure means the update check never hard-fails from the user's perspective.
Status polling is fine. I considered WebSockets for the upgrade progress, but polling every 5 seconds is plenty responsive for an operation that takes 30-60 seconds. Keep it simple.
The whole thing has grown from the initial implementation to roughly 500 lines of Python, 550 lines of React, 170 lines of standalone shell script, and 300 lines of Nix. Not bad for a feature that means I never have to SSH in to deploy again.
This grew from just a whim 'hey, how can I make this auto-update' to 'hey, it would be cool if I could do in-app updates' to the current iteration. I think I'll create a git repo & blog post around how to implement this in your application, for ease of reference.
All for now,
-BadgerOps



Introducing Graphēon
BadgerOps — Mon, 09 Feb 2026 15:31:43 GMT
Long time, no post! (I know, I know.)
So. You're a blue teamer, or a red teamer, or maybe just the person in the room who got volun-told to "map out the network." You fire up nmap, run some scans, maybe throw in some arp -a and netstat output for good measure. And then... you stare at a wall of XML, CSV, and text output trying to mentally correlate which hosts talk to which other hosts, what services are where, and how it all fits together.
Sound familiar?
The Conversation
This project started, as many things do, with a conversation with a co-worker. We were talking about the struggle of building a network map when you're enumerating a new environment. Whether you're on the defensive side trying to understand what you're protecting, or on the offensive side trying to figure out what's interesting - the problem is the same:
You have multiple tools generating multiple outputs in multiple formats, and somehow you need to correlate all of that into something resembling a coherent picture of the network.
The typical workflow looks something like this:
Run nmap scans
Maybe grab some netstat output from hosts you have access to
Throw in some arp tables
Possibly a traceroute or two
Open up Visio, draw.io, or (even better?) a whiteboard
Manually start connecting dots
And by "connecting dots" I mean squinting at IP addresses across 14 terminal tabs and praying you don't accidentally mistype 192.168.1.14 as 192.168.1.41 in your diagram.
Has anyone actually enjoyed this process? Ever?
Enter Graphēon
Graphēon is a tool designed to help quickstart the network enumeration process using standard tooling and correlation. The idea is simple: you feed it the output from tools you're already using - nmap, netstat, arp, ping, traceroute, pcap - and it normalizes, tags, and correlates that data into an interactive network graph.
No more copy-pasting IP addresses between terminal windows. No more manually drawing boxes in Visio at 2am.
The stack is FastAPI + SQLite on the backend and Vite + React on the frontend. Python 3.12. Nothing exotic, nothing that requires a cluster of 47 microservices to deploy.
What it does
Ingests scan outputs from nmap, netstat, arp, ping, traceroute, and pcap files
Normalizes the data - because every tool has its own special way of reporting the same information
Tags entities and correlates related hosts across different scan sources
Visualizes the resulting topology as an interactive network graph
Exports to GraphML (for Gephi, yEd, Cytoscape) or draw.io format
That last point is important. Graphēon isn't trying to replace your favorite graph tool. It's trying to get you from "pile of scan data" to "usable network map" as fast as possible, and then let you take that map wherever you need it.
Why the name?
Naming is hard. So I asked my good friend Claude for some help, after throwing a bunch of ideas at it. The name evokes graphing and mapping. The project fuses disparate network signals into a coherent graph of hosts, edges, and topology. Also, it sounds cool and the domain wasn't taken. (Priorities.)
Getting Started
Graphēon runs as two Docker containers - a backend and a frontend. The frontend proxies /api requests to the backend. Deployment is pretty straightforward:
# Pull images
docker pull ghcr.io/badgerops/grapheon-backend:latest
docker pull ghcr.io/badgerops/grapheon-frontend:latest

# Run backend
docker run -d --name grapheon-backend \
  -p 8000:8000 \
  -v grapheon-data:/app/data \
  -e JWT_SECRET="$(openssl rand -hex 32)" \
  -e LOCAL_ADMIN_USERNAME=admin \
  -e LOCAL_ADMIN_EMAIL=admin@example.com \
  -e LOCAL_ADMIN_PASSWORD=changeme \
  ghcr.io/badgerops/grapheon-backend:latest

# Run frontend
docker run -d --name grapheon-frontend \
  -p 8080:8080 \
  --link grapheon-backend:grapheon-backend \
  ghcr.io/badgerops/grapheon-frontend:latest
Hit http://localhost:8080 and you're in business.
(And yes, please change the default password. I've learned a thing or two about hard-coded creds.)
It also supports OIDC authentication with Okta, Google, GitHub, GitLab, and Authentik if you want proper multi-user RBAC. Check out the docs/auth_provider.md in the repo for that setup.
Current State & What's Next
Graphēon is currently at v0.8.x - it's usable, it's useful, but it's not "done" (is any project ever done?). There are open issues and plenty of room for improvement.
If you're someone who regularly deals with network enumeration - whether as a pen tester, SOC analyst, incident responder, or that one infra person who inherited a network with zero documentation - give it a spin. File issues. Submit PRs. Tell me what's broken.
The project is BSD-2-Clause licensed, because sharing is caring.
The TL;DR
Network enumeration produces a lot of data from a lot of tools. Graphēon takes that data and turns it into a network graph so you can stop playing "human correlator" and start actually analyzing your network.
Check it out: https://github.com/BadgerOps/grapheon
-BadgerOps


Decoding Kubernetes Secrets with jq
BadgerOps — Wed, 08 Jan 2025 03:48:03 GMT
It's been a while since I've posted, and I generally post about things I've learned / have helped me in my day to day role. This is a quick blog post about easily decoding base64 encoded secrets in kubernetes, using Kubectl and jq.
Before diving into decoding Kubernetes secrets, let's set up a local development environment using kind (Kubernetes in Docker). If you haven't used kind before, it's a fantastic tool for local Kubernetes development. Here's how to create a basic cluster:
kind create cluster
That's it! Once your cluster is ready (usually takes about a minute), you can verify it's working:
kubectl cluster-info --context kind-kind
Now, let's explore how to work with Kubernetes secrets. We'll create a simple secret and learn different ways to decode it.
First, let's create a secret with a few key-value pairs:
❯ kubectl create secret generic decode-example \
  --from-literal=key1=value1 \
  --from-literal=key2=value2 \
--from-literal=key3=value3
secret/decode-example created
When we examine this secret using kubectl get secret with JSON output, we can see our values are base64 encoded:
❯ kubectl get secret/decode-example -ojson | jq 
{
  "apiVersion": "v1",
  "data": {
    "key1": "dmFsdWUx",
    "key2": "dmFsdWUy",
    "key3": "dmFsdWUz"
  },
  "kind": "Secret",
  "metadata": {
    "creationTimestamp": "2025-01-07T18:06:56Z",
    "name": "decode-example",
    "namespace": "default",
    "resourceVersion": "674",
    "uid": "d0369e58-089c-4ea3-8998-4c931b904ef2"
  },
  "type": "Opaque"
}
Here's where things get fun. We can use jq to decode these values in several ways, depending on what information we need:
For a simple list of decoded values:
❯ kubectl get secret/decode-example -ojson | jq -r '.data | to_entries | .[] | .value | @base64d'
value1
value2
value3
If you need specific keys with their decoded values (notice we're only selecting key1 and key3):
❯ kubectl get secret/decode-example -ojson | jq '{key1: .key1 | @base64d, key3: .key3 | @base64d}'
{
  "key1": "value1",
  "key3": "value3"
}
But the ez-mode approach to get all keys and values decoded in a clean JSON format is to use map_values() as seen here:
❯ kubectl get secret/decode-example -ojson | jq '.data | map_values(@base64d)'
{
  "key1": "value1",
  "key2": "value2",
  "key3": "value3"
}
This last command is useful as it preserves the structure while giving us human-readable values, and reduces the labor intensive method of the previous example.
By leveraging kubectl and jq together, we can quickly decode and inspect Kubernetes secrets without needing additional tools or complex scripts. Pretty neat, right?
Don't forget to clean up your kind cluster when you're done:
kind delete cluster
-BadgerOps


CVE's
BadgerOps — Mon, 22 Apr 2024 18:13:35 GMT
CVE's. Gamified? Maybe. Useful? Maybe. Fast-becoming-too-complex to manage? I think so.
But. This is currently one of the best ways of unified reporting & alerting of vulnerabilities to a wide audience. There's certainly room for improvement.
Recently, I got an inside view on the process.
My current ${DAY_JOB} is heavily a Red Hat shop. We're using quite a few of their offerings, including Mirror Registry which is a packaged single-node Quay instance for disconnected environments container hosting.
Is Quay pronounced "Kee" or Kway? Ancient scholars maintain the meaning was lost long ago...
I was spinning up several Mirror Registry deployments across a few disconnected environments, when I realized something.
They all had identical CSRF SECRET_KEY values
They all had literally password for Postgres and Redis (hey, I used literally correctly!)
They all had identical Database SECRET_KEY values
This... is not optimal.
So. I did 2 things. Well, 3. But lets talk about the first two.
I emailed Red Hat Security secalert@redhat.com per their documented procedures on 23 Feb 2024
I prepared a PR  to resolve the issue, and submitted after coordinating with Red Hat Security.
The initial email exchange went smoothly and rapidly, then.... Things languished. No response for 1.5 months, and I finally... well, I finally resorted to the method everybody who gets fed up with waiting. I tweeted angrily and wouldn't ya know, 2 hours later we had forward progress.
And, surprisingly, to me at least - they issued 4 CVE's for these reported issues.
CVE-2024-3622
CVE-2024-3623
CVE-2024-3624
CVE-2024-3625
To be honest, I wasn't looking for CVE's, I'm not a bug bounty enthusiast (and it doesn't look like Red Hat even participates in any!) I'm just a server wrangler who doesn't like hard-coded passwords.
So, it was fun to contribute back to a project that has provided value.
It was cool to be on the submitting end of a security report (for once)
And then, uh... Well, I did say 3 things, right? So:
 3: I helpfully included a bug so .... another PR to fix that issue.
At the end of the day, some people love CVE's, some people hate them and some just like calculating CVSS's way too much. They're a tool, a method for cataloguing and communicating Security vulnerabilities in a (mostly) consistent manner.
Use the tools we have available, and patch those bugs!
(And please, please don't hard-code creds...)
-BadgerOps


Dachau
BadgerOps — Tue, 19 Mar 2024 14:00:47 GMT
Last weekend, we traveled to the Alpine town of Garmisch, explored the beautiful Neuschwanstein Castle, and on our return trip to Stuttgart, visited the Dachau memorial.
Words cannot express what we experienced while walking through the memorial site.
The utter disregard for humanity.
The engineered, optimized methodology for corralling, oppressing, destroying an entire group of people was evident. This was not just thrown together.
How does this happen? In retrospect, it is easy to see the slow fade that rolled into a massive swing towards something so despicable it is impossible to dwell on it without a overwhelming sense of despair.
How do we stop from continuously repeating this type of behavior? Why do humans inevitably generate "reasons" to attack,  oppress, destroy others?
What can I do to help? Individually, not much - not on a global scale. Together? Still, not much. We're up against geopolitical structures that simply do not want or care to actually do the morally right things. It doesn't generate power or money - but maybe this is just me being jaded, cynical and using those excuses to decouple my emotions from my observations.
I don't have answers. But I am more thoughtful than I was yesterday, and with that maybe am better equipped to be a positive force in this world. Is the phrase "Be the change you wish to see in the world" over used? Maybe, again, from a cynical point of view.
Choose not to be cynical, choose not to be jaded. Choose to care.
So.
What am I going to do about this? I don't know. Focus more on being a better Husband, Father, Brother, Friend for one. Looking for opportunities to do something outside myself for other people.  I can't stop the bad, but I can spread good.
Be the change you wish to see in the world.


Keycloak & Open Shift
BadgerOps — Sat, 03 Feb 2024 15:06:07 GMT
Hi there!
So. You're running Open Shift Container Platform 4.12+ and you're wanting to deploy that shiny new Red Hat Keycloak Operator (v22) and set up Oauth from Keycloak into Open Shift. 
How do you deploy Keycloak as an IDP for Open Shift? The magic words being "Configure Keycloak as an IDP for Open Shift" [hashtag seo]. 
Well, let us talk about that.
Is it straightforward? Sort of.
Well documented? No.
Let's fix that.
First off, some assumptions:
Lets assume you are paying for Red Hat and have access to the OperatorsI'll add some "here's how to do it without operators" options, but I'm 99% sure, if you're running Open Shift, you're in the RH ecosystem.
Lets assume you're deploying a standalone PostgreSQL DB, or cluster - I used the excellent Percona Operator for PostgreSQL. NOTE: The "Red Hat Certified" version of the operator is 2.3.1 as of this writing.
Finally, I'm going to assume that you know how to click "install" on the operator page, so I'm not going to walk you through that step by step. 
If you don't want all the preamble, skip down to the "Configuring Oauth & Groups" section because that is what you're likely stuck on.
Create yourself a  namespace, er, I mean project - I used keycloak since I am a super creative individual. In that namespace, deploy your PostgreSQL DB in whichever manner you choose - I am using the above mentioned Percona Operator, and took all the defaults when deploying the PerconaPGCluster using the "create PerconaPGCluster" button.
Go grab a cup of coffee, tea or $beverage, it'll take a few minutes for everything to deploy.
If you're not into the operator/PostgreSQL cluster thing, then you can just deploy an ephemeral PostgreSQL DB following the Keycloak.org guide here 
Next, install the Red Hat build of Keycloak v22 operator into the namespace.
After that install is complete, we'll deploy the Keycloak instance.
Note: as of the time of this writing the operator default yaml is incorrectly formatted. I pushed a bugfix to the upstream github repo to fix the structure, but it hasn't made it down to the Red Hat build yet.
Wrong structure:
kind: Keycloak
apiVersion: k8s.keycloak.org/v2alpha1
metadata:
  name: example-keycloak
  labels:
    app: sso
  namespace: keycloak
spec:
  instances: 1
  hostname: example.org
  tlsSecret: my-tls-secret
Correct structure:
kind: Keycloak
apiVersion: k8s.keycloak.org/v2alpha1
metadata:
  name: example-keycloak
  labels:
    app: sso
  namespace: keycloak
spec:
  instances: 1
  hostname: 
    hostname: example.org
  http:
    tlsSecret: my-tls-secret
Both the hostname and tlsSecret blocks are incorrectly formatted, which will result in a failed Keycloak instance deployment.
Here is a screenshot of correctly configured yaml for my Keycloak deployment:
Now, this is actually an incorrect deployment! I'm going to dig into that in another blog post since it is a separate issue, but re-using the *.apps.. certificate and ingress will result in a weird issue where sometimes console.apps.. traffic will get sent to the Keycloak service/pod due to http/2 connection reuse getting confused with the console route and the Keycloak route. 
Also, as annotated in https://docs.openshift.com/container-platform/4.12/networking/ingress-operator.html#nw-http2-haproxy_configuring-ingress we see the 'correct' workaround is to have a completely separate certificate used:
To enable the use of HTTP/2 for the connection from the client to HAProxy, a route must specify a custom certificate. A route that uses the default certificate cannot use HTTP/2. This restriction is necessary to avoid problems from connection coalescing, where the client re-uses a connection for different routes that use the same certificate.
That said, it is not immediately obvious where http/2 is set to be default for Open Shift 4.12, or the Keycloak ingress itself. :shrug: this one took quite a while to track down and figure out, since the symptom was traffic randomly getting sent to the wrong pod (Keycloak). 
Ok, moving on.
Now that your Keycloak instance is deployed into your namespace, grab the password:
❯ oc -n keycloak get secret example-keycloak-initial-admin -o jsonpath='{.data.username}' | base64 --decode ; echo

admin    

❯ oc -n keycloak get secret example-keycloak-initial-admin -o jsonpath='{.data.password}' | base64 --decode ; echo

                                  
If you're not familiar with Keycloak - it can be quite complicated to get going initially; which is the point of this blog post - then you'll want to refer to the Keycloak Server Admin docs, specifically the Realm creation/configuration section. You don't want to use the master Realm for your internal apps. I created a new realm called badgerlab to create my Open Shift client.
Now, create a new client in that new realm:
Turn on Client Authentication!
Set a redirect URI, and Web Origin
One huge piece of confusion that I had was on what redirect URI's would be required for this - again, something that is not well documented and I am unfortunately not very familiar with.
https://oauth-openshift.apps../oauth2callback/*
Thats it. That's the one redirect URI you need.
The Web Origin URL should be https://oauth-openshift.apps..
Now that your client is created you can create users and groups! For clarity, I'm creating a single user and a single group.
Cool, we're all set on the Keycloak side now! (Ok, mostly - that is a teeny white lie, but lets let it play out)
Setup on the Open Shift side is more straightforward, we can follow along with the docs to configure what we created on the Keycloak side.
There are a couple of ways to configure an oauth provider on the Open Shift side, but if you're logged in as kubeadmin you'll have a nice blue banner with a convenient hyperlink to click
Or, you can go to Administration -> Cluster Settings -> Configuration and search for 'Oauth'
Either way, add a new OpenID connect Identity Provider (IDP)
Optionally, you can create a secret in the openshift-config namespace, with clientSecret as the key, and your client secret from Keycloak as the value, then use the following yaml structure to manually create an Oauth config:
spec:
  identityProviders:
    - mappingMethod: claim
      name: openid (whatever name you choose!)
      openID:
        claims:
          email:
            - email
          groups:
            - groups
          name:
            - name
          preferredUsername:
            - preferred_username
        clientID: openshift
        clientSecret:
          name: 
        extraScopes: []
        issuer: 'https://keycloak.apps../realms/'
      type: OpenID
Note: the issuer is just the URL to your Keycloak realm - you can easily find it by going to realm settings in Keycloak, then clicking the 'OpenID endpoint configuration' hyperlink, that will return the full .well-known/openid-configuration url that Open Shift needs. You'll need to just remove the .well-known/openid-configuration suffix and use the rest of the url with no trailing slash.
Ok, now that is all configured, you should be able to either log out of kubeadmin, or open a incognito tab/new browser window (recommended - keep your kubeadmin session open!) to test this out!
Going back to the console URL you should now see a new login button for openid, or whatever you named it:
Lets select openid:
This will be your Keycloak page loading, with the realm name you set, and any theme you want to set on Keycloak visible here
Oh no?!? What the heck is this? You may or may not see this screen. I'll explain in a minute, just do a hard refresh ( ctrl + shift + r ) once or twice and it should load the console
As an authenticated but not authorized user, you won't see anything in the cluster, but if we swap back to our kubeadmin session, we'll see a new user:
Note the identities section where we see openid: for the user, and the name, fullname, etc are sync'd over from Keycloak.
Ok, we're on the right track. Now, how do we give the user access to do stuff?
Here is where we will use the RBAC docs to map the group we created in Keycloak to a Role in Open Shift.
But first, where is that group? We know we see the user sync'd over from Keycloak, but I would expect the group I created and added to my user to also sync over, but alas, there is nothing there!
This one hung me up and had me very frustrated for a while until I stumbled across this stackoverflow post. The stackoverflow user Bench Vue provided a wonderful step-by-step sequence of screenshots to ensure that Keycloak adds the user groups into the Java Web Token (JWT) that is sent to the client, which in our case is Open Shift.
Heading back to our Keycloak page, we navigate to:
our Realm we created
Client Scopes
Profile
"Add Mapper" 
"Group Membership"
set name to user_group_in_jwtUNSELECT 'full group path' or else the group will have a preceding / which makes Open Shift error out very badly
Stackoverflow article screenshot here for posterity:
With this added, log out of your user session in Open Shift, and log back in. Now you should be able to check to see the group is sync'd over in your kubeadmin Open Shift session:
So, we can click into that group that is synced over from Keycloak, swap to the 'rolebindings' tab and click "create binding". We'll make this a cluster rolebinding, and give ourselves the cluster-admin role:
Once we click 'create' we can swap back to our user session and refresh, we should see that we're now cluster admin's with the ability to see everything:
Note, this is a small 3 node cluster running on some old laptops. It is wildly underpowered and Open Shift likes to complain about that, ha!
So now at ~1,600 words in I'm about ready to wrap it up.

But Wait! What about that weird "We are sorry..." screen that came up? 
Remember 25 minutes when I was saying something about http/2 connection reuse? And that I re-used the *.apps.. certificate for this Keycloak deployment?
As it turns out, that causes a weird problem where Open Shift SDN will sometimes, but not always, randomly send traffic meant for the Open Shift Console to the Keycloak pod, which quite correctly states "I don't know what to do with this? Sorry?"
There are a couple 'correct' fixes here. We can set tls: reencrypt on the ingress, or create a new certificate keycloak. and set our ingress to use that instead of re-using the same  *.apps.. certificate. 
Which frankly, is a little annoying since the whole Open Shift "Hey, everything lives under the  *.apps..  path and "just works" - uh, that is, until it doesn't.
What I ended up doing in our Production environment was creating a new certificate and using that, along with a slightly different Keycloak DNS path keycloak.. - and set an 'A' record pointing to the same IP that  *.apps.. used for that Keycloak DNS entry. That works.
So that's it, that's the blog post. I hope that it helps you figure out how to deploy Keycloak as an IDP for Open Shift using Oauth, and setting up RBAC mapping for your Keycloak Group(s) to Open Shift Roles. I may write a more in depth blog post on that in the future as I learn more in that area.
As always, feel free to hit me up on twitter @badgerops or mastodon.social/@badgerops if you have pointers, questions or corrections.
Cheers!
-BadgerOps


smtp socket: malformed response on a FIPS 140-2 system
BadgerOps — Sat, 06 Jan 2024 16:56:24 GMT
Ok, this is a very highly specific post - but I hope it is useful for that sysadmin who's tearing their hair out trying to figure out wtf is going on with smtp failing with a vague error message.
Recently, I was configuring a Postfix SMTP relay on a FIPS140-2 enabled system, and had a weird error that I hadn't ever seen before:
warning: private/smtp socket: malformed response
warning: transport smtp failure -- see a previous warning/fatal/panic logfile record for the problem description
warning: process /usr/libexec/postfix/smtp pid  killed by signal 11
warning: /usr/libexec/postfix/smtp: bad command startup -- throttling

The warning: private/smtp socket: malformed response line is specifically what the error was. 
Googling this issue turns up mostly chrooted postfix issues, or incorrect permissions on the /etc/services file. Not a lot of useful information for my specific issue!
In this Red Hat Knowledgebase article I finally found the correct answer! Now, it's obviously paywalled behind a Red Hat subscription, however knowing the magic string to search for turns up this stackoverflow article and we see that converting the hashing function from md5 which is disabled on a FIPS 140-2 enabled system to sha256 by running the following commands:
# postconf -e smtp_tls_fingerprint_digest=sha256
# postconf -e smtpd_tls_fingerprint_digest=sha256
# systemctl restart postfix

You can also just add the two lines to your /etc/postfix/main.cf file, whatever floats your boat. It would be super if they were in that file commented out, but they're not (at least not on RHEL 8.x)
That's it, that's the blog post. Go forth and send emails.
-BadgerOps


Germany
BadgerOps — Sat, 06 Jan 2024 10:07:21 GMT
Those of you who've been following my sporadic social media postings will know that the BadgerFam moved to Germany last summer.
I'm doing some work for some people that involves a lot of yaml, stiggin', bare metal -> openshift, fips, fapolicyd, selinux and much more fun. I'll let you draw your conclusions on the specifics.
Why Germany? Well, Mrs Badger and I have been talking about an extended EU trip for quite a few years now. We had initially planned on renting a flat for ~3 mo one summer, working remote, and exploring Europe. Germany is a pretty central place to be able to quickly reach other countries, plus I like trains. 
(sadly, the train system is currently a mess - lots of strikes, disruptions in service, delayed trains all the time)
But no, really, why Germany? Well, an interesting opportunity presented itself in Stuttgart, and the company I'm working with was pretty generous with resources - allowing us to move over here, live, send our kids to school and more. I'm pretty happy with this project, it's challenging - but I've been lucky to have some very good experiences in my past that set me up well to help solve the problems they're facing. 
The spawn have settled into school, spawn0 is heavily involved with Drill & Ceremony teams - learning to spin rifles and do precision movements. 
Spawn1 is involved in a local group play (Schoolhouse Rock!) and is enjoying that experience. (mostly - its a lot of practice)
What have we been doing outside of work and school?
Exploring!
Castles!
Several photos of castles in our travels. The sharp eyed will notice the bottom two are in London. I am bad at taking photos, too busy taking it in!
We've seen a number of really cool castles, and taken very few photos of them. Been really bad at taking photos along the way. 
Being able to see the evolution of building styles across regions and centuries is quite fascinating. Many of the castles are well maintained to this day, and are cheap or free to enter.
Kirches!
A variety of Churches (Kirches) we've seen in our travels.
Check out this photosphere of a very well preserved Kirche in Mainz, Germany



Theme parks?
We went to the Deutchland Legoland and dominated the Firefighting competition. If you haven't had the joy of this competition, it is a group based challenge where you pile into a "fire engine" that is pump/piston powered. You have to move it about 100' to the other side of the play area, where there is a pump powered water cannon and "fire" in the windows of a fake building. Once a sufficient amount of water has been pumped through the window, the "fire" drops out of view signalling it is time to pile back in and book it back across the play area. I'm proud to say, we made it back to the start line before the next group even finished putting out their fire. The kids were less impressed than Mrs Badger and I, for some strange reason...
We also were able to make it down to Disneyland Paris, which was a really cool experience. It felt just like Disneyland "back home". The rides were all significantly more intense than California or Florida. I'm a fan!
Whats next?
We are here through ~June of 2024, and will be heading back to the 208 at that time. We're enjoying the experience here, but we miss our friends, family, camping, tacos, motorcycles and especially Mac - the worlds greatest dog.
There are so many things we still want to see while we're over here, I'll do my best to write a little more about them.
Cheers!
-Badger


Wigle wardrive from Idaho -> D3F C0N
BadgerOps — Sun, 07 Aug 2022 18:55:54 GMT
Long time, no post! 
Driving down from Idaho to Las Vegas for Black Hat, BSidesLV and D3F C0N. Figured, hey, what the heck, let's war drive!
I grabbed my trusty old Alfa USB WiFi card, Garmin eTrex Legend & serial adaptor and a unused Raspberry Pi from the box-o-parts, installed Kismet and tossed it in the back of the car. Fingers crossed, I'll actually get some useful AP's scanned during this trip. Lazy-posting my wigle.net deets below, I'll update this post as we get more info...
Posting the script I used on my github - it's pretty simple, but does the trick.
-BadgerOps



Export/Clone Linode VPS to AWS EC2
BadgerOps — Tue, 10 Mar 2020 12:28:15 GMT
Today we're looking at two methods of migrating a Linode Linux instance to an AWS EC2 instance. We can use the official Linode disk copy guide as a starting point, but that doesn't really get us all the way there, as we still need to import that image. This guide will walk you through the following:
Harder, but more flexible process if you can't create a second disk image in Linode (based on instance size)
Create disk image of your target Linode
Copy the disk image to S3 so we can use the AWS snapshot import tool
Import disk image to AWS as a disk snapshot
Create new EC2 instance
Write snapshot to EC2 volume
Update fstab, grub, network interface configuration on EC2 
 Profit Reboot and use the new cloned image
Easier process if you are able to create a second disk in your Linode of a slightly larger size than your main disk
Create disk image of your target Linode
Copy the disk image to S3 so we can use the AWS AMI import tool
Import disk image to AWS as an AMI
Create new EC2 from that AMI
Please read through both sets of instructions to familiarize yourself with the process, then follow along! I would love feedback, you can reach me @badgerops on Twitter, or find my email address on my profile and reach out that way. Thank you!
The first thing we'll do is ensure we have the needed pre-requisites, as well as a written down process, as there are a couple of ways to accomplish the import depending on the resources you have available.
1: You'll need AWS CLI credentials, or, the ability to create IAM roles & policies from the AWS Console.
2: You'll need either enough disk space in your Linode to create an image of the disk, or a large enough EC2 Volume attached to an EC2 instance to copy the image to over SSH so you can then import the image.
3: A written procedure for what you're doing, don't just follow along with this post! Write down your steps and mark them off as you go so you don't do what I did and have to go back and do a step over again that you missed. Ask me how I came up with this pre-requisite.
A quick note on Linode VPS disks: based on the Linode size you have chosen, and the way you configured your disks initially, you may or may not have enough disk space to create your disk image in Linode. You have a couple options:
Resize your Linode to be a bigger instance, choose the instance that will (at least) double your current storage size, so you can create an image
Copy the disk image over SSH as its being created to another EC2 instance (or your workstation) so you can import it from there. Your target EC2 or workstation will need to have enough disk space for the image you're creating.
Now, on to the guide:
NOTE: If you'd like to follow along with the AWS guide for importing a VM/Image/Snapshot, the documentation is available here
Create S3 bucket
From the AWS Console, navigate to S3 and create a bucket, or identify an existing S3 bucket you'd like to use to store the disk image. You could also use the AWS CLI tool to create your S3 bucket. Due to the way the AWS vm import tool works, we have to use an S3 bucket.
Create IAM Role and Policy
First we'll look at the role and policy you need to create regardless of whether you're using AWS CLI credentials, or if you only have access to the AWS Console. This policy allow you to read from the S3 bucket, and write to EC2 to create a snapshot.
A role to allow you to import a VM (or, in our case a disk image)
A policy to allow your credentials, or EC2 instance to run the import. This will be assigned to the role.
Here is the example role in json format:
{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": { "Service": "vmie.amazonaws.com" },
         "Action": "sts:AssumeRole",
         "Condition": {
            "StringEquals":{
               "sts:Externalid": "import-vm-role"
            }
         }
      }
   ]
}
Save this as vm-import-role.json
You can create a new role in the AWS Console, or run the following command from the AWS CLI:
aws iam create-role --role-name import-vm-role --assume-role-policy-document "file:///path/to/import-vm-role.json"

Next, we'll create the policy that allows us to read from the S3 bucket that we'll put the image in, and upload the image to EC2 as a snapshot:
Note: you'll need to insert your S3 bucket name where I have  listed in the resource section
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetBucketLocation",
            "s3:GetObject",
            "s3:ListBucket"
         ],
         "Resource":[
            "arn:aws:s3:::",
            "arn:aws:s3:::/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetBucketLocation",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject",
            "s3:GetBucketAcl"
         ],
         "Resource":[
            "arn:aws:s3:::",
            "arn:aws:s3:::/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "ec2:ModifySnapshotAttribute",
            "ec2:CopySnapshot",
            "ec2:RegisterImage",
            "ec2:Describe*"
         ],
         "Resource":"*"
      }
   ]
}


Save this as vm-import-policy.json
Once again, you can use the AWS Console to create the policy, or run the following command from the AWS CLI:
aws iam put-role-policy --role-name import-vm-role --policy-name import-vm-policy --policy-document "file:///path/to/vm-import-policy.json"
Prepare Linode for backup
At this point you'll want to have your plan in place for how you're planning on backing up your Linode, as we're going to shut the Linode down for the next few steps. 
Ensure everyone using your Linode knows you're shutting it down!
NOTE: if you have sensitive data and/or a database on this Linode, consider taking a backup before proceeding.
Reboot Linode to the Finnix recovery environment 
Connect to your Linode using Lish
If you've decided to back up to a second disk on your Linode and import from there, skip the next section and go to the "Create disk image to Linode second disk (simple/fast method)" section. If you're copying your image over SSH to an existing EC2 instance, or your workstation (this is what I did) then keep reading.
Create disk image from Linode over SSH tunnel
First create a (long!) root password and start the SSH service so we can connect to this Linode to create the image. You could also use ssh keys if you'd prefer not to use password based authentication.
passwd
service ssh start
Set a root password and start ssh
Next, from your existing EC2 instance, or workstation run the following command from a screen (or Tmux) session. (In case you lose connection to your EC2 instance, you don't want the backup command to fail)
ssh root@ "dd if=/dev/sda " | dd of=/path/to/linode.img
This command will use the Linux dd utility to copy from your Linode to an image on your EC2 or workstation. 
Depending on how large your Linode is, and how much bandwidth you have available to you, this could take a few hours. Once the command completes, move on to preparing and importing the image to S3.
Prepare disk image for import to AWS S3
optional do the following steps to shrink the disk image if you have a large amount of free space! In this example, I had a vastly overprovisioned Linode and wanted to reduce the size of the image before import.
# first, verify the overall size of the image
du -h -s linode.img

1.3T linode.img

# create loop partition

losetup --find --partscan linode.img

# verify it got created and has the size we expect

lsblk | grep loop0
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0           7:0    0  1.3T  0 loop

fdisk -l /dev/loop0
Disk /dev/loop0: 1.3 TiB, 1373856858112 bytes, 2683314176 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

# for paranoia's sake, run e2fsck

e2fsck -f /dev/loop0 

# mount the image for the next few steps

mkdir -p /mnt/linodeimg

mount /dev/loop0 /mnt/linodeimg

# then run fstrim on it, we can use fstrim to remove any blocks not used # by the filesystem as noted in the man page:
# "fstrim is used on a mounted filesystem to discard (or "trim") blocks # which are not in use by the filesystem.  This is useful for 
# solid-state drives (SSDs) and thinly-provisioned storage.

fstrim -v /mnt/linodeimg
/mnt/linodeimg: 1 TiB (1138564808704 bytes) trimmed (wow!)

# unmount the loop partition

umount /dev/loop0

# confirm the physical disk is resized

du -h -s linode.img
220G    linode.img

# now that the actual size is reduced, we'll also want to reduce 
# the fileystem size because it still thinks its 1.3T in size!

lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0           7:0    0  1.3T  0 loop

resize2fs linode.img 250G

lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0           7:0    0  250G  0 loop

# remove the loop device map

losetup -d /dev/loop0
Now we copy the image into the S3 bucket that we've prepared previously:
aws s3 cp linode.img s3:///
Completed 56.0 GiB/250.0 GiB (138.1 MiB/s) with 1 file(s) remaining
Following along with https://docs.aws.amazon.com/vm-import/latest/userguide/vmie_prereqs.html we'll import the image using the policy and role we created previously.
Create a containers.json file with the following format:
[
  {
    "Description": "Linode Image",
    "Format": "raw",
    "UserBucket": {
        "S3Bucket": "",
        "S3Key": "/linode.img"
    }
}]
Now, import the disk as a snapshot:
time aws ec2 import-snapshot --region us-west-2 --description "Imported Linode Image" --disk-container "file:///containers.json"
{
    "SnapshotTaskDetail": {
        "Status": "active",
        "Description": "Linode Image",
        "Format": "RAW",
        "DiskImageSize": 0.0,
        "UserBucket": {
            "S3Bucket": "",
            "S3Key": "linode.img"
        },
        "Progress": "3",
        "StatusMessage": "pending"
    },
    "Description": "Linode Image",
    "ImportTaskId": "import-snap-"
}
Note: we can monitor the progress by running:
aws ec2 describe-import-snapshot-tasks --import-task-ids import-snap- --region us-west-2
{
    "ImportSnapshotTasks": [
        {
            "SnapshotTaskDetail": {
                "Status": "active",
                "Description": "Linode Image",
                "Format": "RAW",
                "DiskImageSize": 268435456000.0,
                "UserBucket": {
                    "S3Bucket": "",
                    "S3Key": "linode.img"
                },
                "Progress": "35",
                "StatusMessage": "downloading/converting"
            },
            "Description": "Linode Image",
            "ImportTaskId": "import-snap-"
        }
    ]
}
Once that import has completed, create a new EC2 image and let it boot - we need to grab a few files off of it!
While its booting, create a new volume of the desired size from the snapshot we just imported. Attach it to the instance and log into the instance once it has booted.
Mount the new volume to /mnt as shown here:
# NOTE: your exact path might differ, use lsblk command to see what the 
# correct path is

mount /dev/nvme2n1p1 /mnt 
We'll need to re-install grub on the new disk image, specify root-directory as /mnt since thats where we mounted the image:
grub-install --recheck --debug --root-directory=/mnt /dev/nvme2n1
Next prepare to chroot into the image by mounting the required filesystems:
for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done
We’ll also grab the netplan config from the  ec2 instance to apply to the new image:
cp /etc/network/interfaces.d/50-cloud-init.cfg /mnt/etc/network/interfaces.d/50-cloud-init.cfg
IMPORTANT: run a blkid to get the UUID of your new image and save that UUID for below
Since we imported the image and created a new volume, the UUID will have changed, we need to update /etc/fstab or else we won't be able to boot!
Then we'll finally chroot in for the last few changes
chroot /mnt

# change the hostname to your desired hostname

echo 'yourhostname' > /etc/hostname

# don't forget to update /etc/hosts with your desired hostname

vi /etc/hosts

# edit /etc/fstab with the new UUID for your image you got from 
# the blkid command above

vi /etc/fstab

# then update grub

update-grub

# thats it! If you have anything else you want to update, do that now.

exit
Unmount the filesystems:
for i in /dev /dev/pts /proc /sys /run; do umount /mnt$i ; done

unmount /mnt
Finally shut down the EC2 instance, and disconnect the volumes from it, then remount the new volume you just created from the snapshot as /dev/sda1 and reboot the EC2. You should now be able to log in to your clone of your Linode!
This process was long and painful to figure out, but I wanted to capture this process to demonstrate that you can still do it if you don't have the ability to create a 'local to Linode' disk image. For the easier path, follow along with the next section.
Create disk image to Linode second disk (simple/fast method)
NOTE: if you use this method, you MUST have AWS CLI access as this method must use the AWS CLI tools to import the disk image.
This is a much simpler/faster method, which is the recommended path if you are able to create a local image in your Linode instance, based on your disk space available. Its adapted from Devon Kurland's post here.
If you've chosen to create your disk image on a second disk in your Linode, you'll want to do the following steps:
Shut down your Linode
Add a second disk that is larger than your primary disk (so you'll have enough room for the disk image to be created)
Set the new disk to be /dev/sdb in the Linode console
Boot into Finnix recovery mode
Now, connect via Lish for the rest of the commands.
# First, install required tools to the Finnix recovery environment

apt-get update
apt-get install python-pip python-setuptools ca-certificates grub2
# When prompted where to install GRUB2, just press Enter, and then select Yes to continue without installing.

# Install the AWS CLI which we'll use to import the image
pip install awscli

# Next, mount the new disk that we'll be creating the backup on and create the server.raw file

mount /dev/sdb /mnt ; cd /mnt
dd if=/dev/zero of=server.raw count=1 bs=1MiB

# Next, copy your Linode primary disk to the raw file
dd bs=1MiB seek=1 if=/dev/sda of=server.raw

# Next, some quick fdisk prep, create a partition and write it
fdisk server.raw
# Press n, accept all of the defaults, then a and w.

# Next, use https://linux.die.net/man/8/kpartx to create a device map from the image so we can then mount it

kpartx -a -v server.raw

mount /dev/mapper/loop1p1 /mnt
# If you receive a "does not exist" error, you may need to run the last two commands again.

# next we'll (re) install grub to the image we just mounted as a loop device. This ensures the MBR has grub installed

grub-install --recheck --debug --boot-directory=/mnt/boot/ /dev/loop1

# Prepare to chroot into the new image
mount -t proc proc /mnt/proc/ ; mount -t sysfs sys /mnt/sys/
mount -o bind /dev /mnt/dev/
chroot /mnt

# Configure the new image, verify grub is installed and re-update it
apt-get install grub2-common linux-image-amd64
cp /usr/share/grub/default/grub /etc/default/grub
update-grub2

# Update our fstab with the new disk UUID
sed -i "s/insmod ext2/insmod ext2\nset root='hd0,msdos1'/g"
echo "UUID=`blkid -s UUID -o value /dev/sda` / ext4 defaults 1 1" > /etc/fstab
printf "nameserver 8.8.8.8\nnameserver 8.8.4.4" > /etc/resolv.conf
printf "auto lo\niface lo inet loopback\n\nauto eth0\niface eth0 inet dhcp" > /etc/network/interfaces

# Do any other steps you might want to do, then exit
exit

# Unmount all the filesystems
for i in proc sys dev ; do umount /mnt/$i ; done
umount /mnt

# Remove the device maps

kpartx -d -v server.raw
Once you're done here, the next 2 commands will import your image to S3, then to EC2 as an image.
aws s3 cp server.raw s3:///
aws ec2 import-image --cli-input-json "{\"Description\":\"server\",\"DiskContainers\":[{\"Description\":\"Imported from Linode\",\"UserBucket\":{\"S3Bucket\":\"bucketname\",\"S3Key\":\"server.raw\"}}]}"
To monitor the import task you can run the following command:
aws ec2 describe-import-image-tasks
Once your import is complete you can navigate to "My AMI's" and create an EC2 instance from there. 


Debugging a PHP app in Kubernetes using Telepresence.io
BadgerOps — Thu, 03 Oct 2019 19:01:21 GMT
Hello folks,
Today we're going to talk about using telepresence.io to debug PHP code running in Kubernetes using Telepresence. I'll refer you to the Telepresence Introduction page for an overview of what Telepresence can do for you, but if you're reading this you probably are already aware and just want the example code. So, let's get to it!
if you want to skip all the discussion and get right to hacking, just scroll past the break for 'gimme the code, man' section
Once you've read the Introduction, then the installation guide is your next stop. The specific version of Telepresence that introduced the ability to use xdebug for remote debugging is 0.102 (October 2, 2019)  you can read the changelog here if you're interested in the details.
Alright, so lets set some assumptions here:
1: you are comfortable with PHP and using xdebug


There are plenty of guides to getting xdebug working on the internet, I used the phpstorm guide


Note: I am also using the browser debugger extension for chrome


2: I am using phpstorm for this example, feel free to follow along with your favorite editor
3: I am doing this from a Mac, but you could just as easily use Linux, all of the tools I use also work there. Windows, I honestly have no idea as I don't use Windows for any development work.
4: I am using both Kubernetes on my Docker for Desktop on Mac and EKS in AWS.
Setup your environment
For this example, I'm just going to do a very simple 'hello world' PHP page that will also expose phpinfo() so you can see the environment variables.
Step 1:
Create yourself a new project in your $editor_of_choice and create your index.php with 'hello world' and/or phpinfo() in it.
Note, if you're not using a Unix compliant shell, then don't use the cat > file << EOF line, more info on heredoc here
mkdir -p ~/code/telepresenceDemo && cd ~/code/telepresenceDemo

cat > index.php << EOF

 
  PHP Test
 
 
 Hello World!'; ?>
 
 


EOF

Step 2:
Create a Dockerfile to test with. I'm using the official PHP  apache docker build (apache for single container goodness. PHP-FPM + Nginx are usable as well and I may cover them in a future post)
This installs and enables xdebug and sets some custom xdebug options in the php.ini file. We also copy the index.php we created to /var/www/html/index.php
cat > Dockerfile << EOF
FROM php:7.2-apache
RUN pecl install xdebug-2.6.0
RUN docker-php-ext-enable xdebug
RUN echo "xdebug.remote_enable=1" >> /usr/local/etc/php/php.ini && \
    echo "xdebug.remote_host=localhost" >> /usr/local/etc/php/php.ini && \
    echo "xdebug.remote_port=9000" >> /usr/local/etc/php/php.ini && \ 
    echo "xdebug.remote_log=/var/log/xdebug.log" >> /usr/local/etc/php/php.ini 
    

COPY ./index.php /var/www/html
WORKDIR /var/www/html

EOF

Step 3:
build our example container
docker build -t mytelepresencetest:01 .

This container pulls from upstream php and we add our xdebug specific settings to it in the above Dockerfile.
Step 4:
Finally, assuming you already have KUBECONFIG set we can fire up Telepresence. Telepresence will use whatever Kubernetes config file you have in your env, or in ~/.kube/config.

more info on Kubernetes config file management here

telepresence --container-to-host 9000 --verbose --new-deployment tele-test --docker-run -p 8080:80 -v $(pwd):/var/www/html mytelepresencetest:01

In this example, I'm mounting the local (pwd) directory to /var/www/html which allows me to edit the index.php in my editor and have it automatically reflect inside the container we're running. You could also specify the explicit path to your code folder if you don't launch Telepresence from the code folder.
If you created the folder as noted above, this would look like:
telepresence --container-to-host 9000 --verbose --new-deployment tele-test --docker-run -p 8080:80 -v ~/code/telepresenceDemo:/var/www/html mytelepresencetest:01

Here is how this command looks as it executes on my machine (on October 3rd 2019)
telepresence --container-to-host 9000 --verbose --new-deployment tele-test --docker-run -p 8080:80 -v $(pwd):/var/www/html mytelepresencetest:01
T: How Telepresence uses sudo: https://www.telepresence.io/reference/install#dependencies
T: Invoking sudo. Please enter your sudo password.
Password:
T: Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.
T: Starting network proxy to cluster using new Deployment tele-test

T: No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

T: Forwarding container port 9000 to host port 9000.
T: Setup complete. Launching your container.
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.2. Set the 'ServerName' directive globally to suppress this message
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.2. Set the 'ServerName' directive globally to suppress this message
[Thu Oct 03 17:04:35.421678 2019] [mpm_prefork:notice] [pid 7] AH00163: Apache/2.4.38 (Debian) PHP/7.2.23 configured -- resuming normal operations
[Thu Oct 03 17:04:35.422032 2019] [core:notice] [pid 7] AH00094: Command line: 'apache2 -D FOREGROUND'

Awesome, we see that apache has launched, and we can see the apache logs in our terminal window.
Step 5:
Open a browser to http://localhost:8080 we should see our "Hello, World" statement followed by phpinfo() output. Here is what you should see in your Telepresence output:
(this line included for continuity from above example block) 

[Thu Oct 03 17:04:35.422032 2019] [core:notice] [pid 7] AH00094: Command line: 'apache2 -D FOREGROUND'

172.17.0.1 - - [03/Oct/2019:17:11:08 +0000] "GET / HTTP/1.1" 200 25347 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
172.17.0.1 - - [03/Oct/2019:17:11:13 +0000] "GET /favicon.ico HTTP/1.1" 404 502 "http://localhost:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"

The key line here is --container-to-host 9000 - this creates a connection from the container back to your computer at localhost:9000 so your xdebug listener can receive data from the code executing in that container.
Now that you have your Telepresence process running swap over to your Editor - again, I'm using PHPStorm + Chrome + the PHPStorm xdebug exension - and turn on your xdebug listener. Also turn on your xdebug extension in your browser. (Configuration instruction links are included for both of these in the top of this blog post under assumption #1)
Once those are running you should be able to refresh your page and see a breakpoint hit in PHPStorm! (The first time you should get a pop up asking for you to map the code to match what you have in your local path vs remote path)
In your editor, go modify your index.php to be "Hello, Telepresence" instead of "Hello, World" and refresh your browser to see the changes reflected in the container.
Now if this were an app that needed access to resources hosted in your Kubernetes cluster, you'd be able to hit those resources from your code that is technically running 'locally' on your box, giving you the ability to hit breakpoints and step through code without having to host all of those resources locally. Nice. Huge shoutout to awesome folks at Datawire who wrote this tool!
The way this all works is Telepresence spins up 2 proxy containers - one in Docker locally on your box, the other in your Kubernetes cluster. Then it routes all traffic from your local Docker container you built and ran through the 'local' Docker proxy container, to the remote Kubernetes proxy container with this chunk of the command:

--docker-run -p 8080:80 -v $(pwd):/var/www/html mytelepresencetest:01

through the Kubernetes proxy side, giving you access to any resources your Kubernetes cluster has available.
If this doesn't work for you - make sure you've correctly configured your editor as noted in Assumption #1 at the top of this post. If you're still having issues, swing by twitter and ask @badgerops or check the github issues to see if someone else is having a similar issue.
If you think Telepresence is awesome and want to contribute, head over to their github and get your sweet Hacktoberfest PR's in. (Assuming you're reading this in October!)

Just the steps aka, 'gimme the code, man'
Step 1:
Create yourself a new project in your $editor_of_choice and create your index.php with 'hello world' and/or phpinfo() in it.
Or, if you want manual steps:
Note, if you're not using a Unix compliant shell, then don't use the cat > file << EOF line, more info on heredoc here
Create your code folder
mkdir -p ~/code/telepresenceDemo && cd ~/code/telepresenceDemo

Create your index.php file
cat > index.php << EOF

 
  PHP Test
 
 
 Hello World!'; ?>
 
 


EOF

Step 2:
Create a Dockerfile to test with. I'm using the official PHP  apache docker build (apache for single container goodness. PHP-FPM + Nginx are usable as well)
cat > Dockerfile << EOF
FROM php:7.2-apache
RUN pecl install xdebug-2.6.0
RUN docker-php-ext-enable xdebug
RUN echo "xdebug.remote_enable=1" >> /usr/local/etc/php/php.ini && \
    echo "xdebug.remote_host=localhost" >> /usr/local/etc/php/php.ini && \
    echo "xdebug.remote_port=9000" >> /usr/local/etc/php/php.ini && \ 
    echo "xdebug.remote_log=/var/log/xdebug.log" >> /usr/local/etc/php/php.ini 
    

COPY ./index.php /var/www/html
WORKDIR /var/www/html

EOF

Step 3:
Build our example container
docker build -t mytelepresencetest:01 .

Step 4:
Profit Run the thing (Assuming pwd is your code repo):
telepresence --container-to-host 9000 --verbose --new-deployment tele-test --docker-run -p 8080:80 -v $(pwd):/var/www/html mytelepresencetest:01

Step 5:
Start up xdebug listener in your editor, open your browser to http://localhost:8080 and start stepping through code! The key line here is --container-to-host 9000 - this creates a connection from the container back to your computer at localhost:9000 so your xdebug listener can receive data from the code executing in that container.
If this doesn't work for you - make sure you've correctly configured your editor as noted in Assumption #1 at the top of this post. If you're still having issues, swing by twitter and ask @badgerops or check the github issues to see if someone else is having a similar issue.
If you think Telepresence is awesome and want to contribute, head over to their github and get your sweet Hacktoberfest PR's in. (Assuming you're reading this in October!)
Hope this helps you out, Cheers!
-BadgerOps
twitter: @badgerops



Forcing initramfs to load udev 70-persistent-net.rules
BadgerOps — Wed, 14 Feb 2018 23:09:10 GMT
Hello fine readers,
Chances are you're scouring the googles much like I was all morning trying to figure out how to get update-initramfs to pull in the 70-persistent-net.rules udev rule.
You may have stumbled on this debian bug or this ubuntu bug which coincidently ties in with my remote LUKS unlocking post.
I quickly found that NEED_PERSISTENT_NET=yes needed to be set, but it wasn't immediately obvious where that needed to be set as update-initramfs ignores environment variables. Well, I finally tracked down  a reference to where you can set that variable.
NOTE:
In the case that link dies, or that line is removed here is the necessary information:

Usually network interfaces are renamed after the root file system has been mounted, so if the root file system is mounted over the network then the 70-persistent-net.rules file must be copied to the initramfs. In most cases this is done automatically, but some setups may require explicitly setting "export NEED\_PERSISTENT_NET=yes" in a file in /etc/initramfs-tools/conf.d/ . If 70-persistent-net.rules is copied to the initramfs then it must be updated every time a new interface is added.
and added it to my deployment script:
echo "export NEED_PERSISTENT_NET=yes" > /mnt/etc/initramfs-tools/conf.d/persistent_net_setup

which triggered the copying of  70-persistent-net.rules the next time I ran update-initramfs
Hopefully this helps you out!
-BadgerOps



Using Saltstack salt-mine
BadgerOps — Mon, 12 Feb 2018 16:10:54 GMT
Edit in March of 2020: Hello! This is one of my more popular posts even in 2020, but I'm curious if you came across this post looking for slightly different information than is presented. If so, shoot me a message on twitter: @badgerops or send an email to blog@badgerops.net with what you're looking for so I can update this post! Thank you.

Today we're going to talk about using salt-mine to help gather information from salt minions. This is a sister post to Using the #!pyobjects renderer as we're consuming mine data to create a custom hosts file.
In this example, we're going to register our IP addresses that match a specific IP address pattern, or CIDR using salt-mine
using a pillar declaration as seen here: (NOTE: this is explained in great detail in the documentation
mine_functions:
  # we build our /etc/hosts file off the 'private/non routable' IP's
  network.ip_addrs:
    cidr: 192.168.0.0/16

This allows us to do a mine lookup salt['mine.get']('*', 'network.ip_addrs') which would return a dictionary that looks something like this:
>>> salt('*', 'mine.get', ('*', 'network.ip_addrs'))
{'saltmaster': {'saltmaster': ['192.168.50.4'], 'linux-1': ['192.168.50.5']}, 'linux-1': {'saltmaster': ['192.168.50.4'], 'linux-1': ['192.168.50.5']}}

breaking this down: salt('*' is functionally the same as salt '*' meaning we run the command on all minions. Then we have the mine.get function, where we pass in ('*', 'network.ip_addrs') as arguments. This mean's we're requesting network.ip_addrs from all the minions. As usual, if you read the documentation you should have a better understanding of how to get information back out of salt-mine.
We could omit the cidr to register all IP's except loopback but that would be functionally equivalent to salt-call grains.get network.ipaddrs (to be clear, that is exactly what is happening under the hood, we're just storing that return in the salt-mine)
There are many other things we could store in the salt-mine, essentially any grain you can look up, or set, can be stored in the salt-mine, one huge reason for this is you can look up data about one minion from another minion, which is something that you cannot do with pillar or grains (they're minion specific), this is why I chose to use salt-mine to create my custom /etc/hosts file.
Hope this brief post was helpful, feel free to comment or hit up @badgerops on twitter!
-BadgerOps

BadgerOps

Treating OpenClaw Like a Junior Sysadmin

The First Attempt

Cortana Becomes the Intern Desk

Identity, Not My Identity

Matrix As The Control Plane

Observability For The Assistant

What Worked

What Still Needs Cleanup

The workflow I want to prove:

The Takeaway

Why I Wrote a New Terraform Provider for UniFi

The problem I was actually trying to solve

What this provider is

Why I built it around the OpenAPI snapshot

The API reality

What it covers today

A small example

Sharp edges, because UniFi is still UniFi

Why this project matters to me

What is next

Give it a try

Building an In-App Auto-Updater for a Containerized NixOS Deployment

The Problem

The App Side

Backend: Checking for Updates

Backend: Triggering an Upgrade

Frontend: The Update Banner

The NixOS Side

Dynamic Tags: The Version State File

The Auto-Update Script (Daily Timer)

The Upgrade Handler Script (grapheon-upgrade.sh)

The GHCR Authentication Saga

Systemd Path Unit: The Glue

Two Tag Schemes

The Full Picture

Lessons Learned

Introducing Graphēon

The Conversation

Enter Graphēon

What it does

Why the name?

Getting Started

Current State & What's Next

The TL;DR

Decoding Kubernetes Secrets with jq

CVE's

Dachau

Keycloak & Open Shift

smtp socket: malformed response on a FIPS 140-2 system

Germany

Exploring!

Castles!

Kirches!

Theme parks?

Whats next?

Wigle wardrive from Idaho -> D3F C0N

Export/Clone Linode VPS to AWS EC2

Harder, but more flexible process if you can't create a second disk image in Linode (based on instance size)

Easier process if you are able to create a second disk in your Linode of a slightly larger size than your main disk

Please read through both sets of instructions to familiarize yourself with the process, then follow along! I would love feedback, you can reach me @badgerops on Twitter, or find my email address on my profile and reach out that way. Thank you!

A quick note on Linode VPS disks: based on the Linode size you have chosen, and the way you configured your disks initially, you may or may not have enough disk space to create your disk image in Linode. You have a couple options:

Now, on to the guide:

NOTE: If you'd like to follow along with the AWS guide for importing a VM/Image/Snapshot, the documentation is available here

Create S3 bucket

Create IAM Role and Policy

Note: you'll need to insert your S3 bucket name where I have listed in the resource section

Prepare Linode for backup

Create disk image from Linode over SSH tunnel

Prepare disk image for import to AWS S3

IMPORTANT: run a blkid to get the UUID of your new image and save that UUID for below

Create disk image to Linode second disk (simple/fast method)

NOTE: if you use this method, you MUST have AWS CLI access as this method must use the AWS CLI tools to import the disk image.

Debugging a PHP app in Kubernetes using Telepresence.io

Setup your environment

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Please read through both sets of instructions to familiarize yourself with the process, then follow along! I would love feedback, you can reach me `@badgerops` on Twitter, or find my email address on my profile and reach out that way. Thank you!

IMPORTANT: run a `blkid` to get the UUID of your new image and save that UUID for below