Archive v2 โ€” btrfs + Fleet Backup

Walter ๐Ÿฆ‰ ยท March 23, 2026 ยท supersedes plan-archive

What Happened

We lost documents twice โ€” cave.html and door.html โ€” because two sessions deployed to the same URL seconds apart. No backups. No snapshots. No git history. Files that existed for 140 seconds, gone forever.

We had been "planning" a backup system for a week. The planning itself was the failure mode โ€” too many layers, too many optimizations, too many things to do at once, so nothing got done. Today we did the simplest possible thing and it took five minutes.

This plan consolidates everything we're considering โ€” what's already done, what's next, and the longer-term btrfs migration that Charlie described. Each layer is independent. Each layer works alone. No layer depends on any other layer.

The Three Layers

LAYER 1 โ€” GCP DISK SNAPSHOTS โœ“ DONE

Every disk in the fleet. Hourly. 4-year retention. Incremental (costs almost nothing). Protects against disk death, VM destruction, catastrophic failure.

12 disks across 7 regions. One schedule per region, all identical. Completed March 23, 2026.

RegionDisksSchedule
us-central1vault, vault-mnt, amy2, jamie, instance-20260203-181104hourly-vault-snapshots
us-west1dannyhourly-fleet-snapshots
europe-west3walter-jrhourly-fleet-snapshots
europe-north1captain-kirk-3hourly-fleet-snapshots
europe-north2matildahourly-fleet-snapshots
me-west1amy-israel, foremanhourly-fleet-snapshots
africa-south1ghost-jrhourly-fleet-snapshots

Recovery: restore a snapshot to a new disk, mount it, copy what you need. Takes 5โ€“10 minutes. Protects against everything except the last 59 minutes of changes.

LAYER 2 โ€” ARCHIVE VM WITH BTRFS

A dedicated backup VM called archive that pulls from every machine via rsync. Its data disk is formatted btrfs โ€” not ext4. This is greenfield, so we do it right from birth.

btrfs snapshots are copy-on-write. A snapshot is not a copy โ€” it's a bookmark. The filesystem says "remember everything that exists right now" and the operation takes microseconds because no data moves. When a file changes later, the old blocks stay where they are. The snapshot is just a pointer to the old blocks.

This means: snapshot every minute. Keep them for 24 hours. A cron job that runs btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s) every minute, with a cleanup that deletes anything older than 24h. The cost is zero additional dollars because unchanged blocks are shared.

Recovery example โ€” if door.html got overwritten 90 seconds ago:

cp /mnt/.snapshots/1711151520/mirrors/vault/mnt/public/door.html \
   /mnt/mirrors/vault/mnt/public/door.html

No gcloud commands. No disk mounting ceremony. Just cp. The snapshots are browsable directories on the same filesystem.

VM Spec

ItemSpecCost
VMe2-small, us-central1-a~$12/mo
Boot disk10 GB pd-ssd, Debian 12~$2/mo
Data disk50 GB pd-ssd, formatted btrfs, mounted at /mnt~$9/mo
NetworkSame zone as vault โ€” free internal traffic$0
Total~$23/mo

Architecture

Archive pulls. Nobody pushes to it. It has SSH keys to every machine. No machine has SSH keys to Archive. One-way mirror.

/mnt/                              โ† btrfs filesystem
โ”œโ”€โ”€ mirrors/
โ”‚   โ”œโ”€โ”€ vault/
โ”‚   โ”‚   โ”œโ”€โ”€ home/daniel/
โ”‚   โ”‚   โ””โ”€โ”€ mnt/                   โ† the motherload
โ”‚   โ”œโ”€โ”€ walter/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ”œโ”€โ”€ walter-jr/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ”œโ”€โ”€ matilda/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ .snapshots/
    โ”œโ”€โ”€ 1711151400/                โ† minute-by-minute btrfs snapshots
    โ”œโ”€โ”€ 1711151460/                   (browsable as normal directories)
    โ”œโ”€โ”€ 1711151520/
    โ””โ”€โ”€ ...

Sync

Rsync from every machine every 5 minutes. After each sync cycle, a btrfs snapshot is taken automatically. Per-minute snapshots run independently as a separate cron job (snapshotting whatever state exists at that moment).

Excluded from rsync: .cache/, node_modules/, .npm/, snap/, .local/share/Trash/, .venv/

LAYER 3 โ€” BTRFS ON VAULT (FUTURE)

The long-term play: replace vault-mnt's ext4 with btrfs. Then vault itself has per-minute snapshots โ€” no archive VM needed for the core data. Recovery becomes instant and local.

This is not greenfield. vault-mnt is live. You cannot convert ext4 to btrfs in place. The surgery is: create a new btrfs disk, rsync everything from the old disk, remount, update fstab, verify, then delete the old disk. That is surgery on a live system. The rule says stop, think, ask.

Sequence: get Layer 2 running first. Once archive is pulling from vault reliably, we have a safety net for the vault migration. Then do the swap. One door at a time.

What Charlie Said About btrfs

GCP disk snapshots protect against the disk dying. btrfs snapshots protect against Walter. Two different failure modes, two different layers. The first is the fire insurance. The second is the undo button.

btrfs snapshots are copy-on-write โ€” a snapshot is a bookmark, not a copy. No data moves. Old blocks stay where they are. New blocks get written somewhere else. The snapshot is just a pointer to the old blocks. The operation takes microseconds.

Daniel's instinct โ€” snapshot every second, keep it for an hour โ€” is not insane on btrfs. It is trivially achievable. But every minute with 24h retention is the practical sweet spot.

The catch: vault-mnt is almost certainly ext4. You cannot add btrfs to an ext4 disk. But the archive VM doesn't exist yet โ€” greenfield. Make its data disk btrfs from birth.

Execution Plan

โœ“ Step 1: Enable hourly GCP snapshots on every disk in the fleet.
Done. 12 disks, 7 regions, hourly, 1460-day retention. March 23, 2026.
Step 2: Create the archive VM.
gcloud compute instances create archive โ€” e2-small, us-central1-a, 10 GB boot + 50 GB data disk.
Format data disk as btrfs (not ext4): mkfs.btrfs /mnt/dev/sdb
Mount with compress=zstd for free compression.
Deliverable: VM running, btrfs data disk mounted at /mnt.
โธ STOP โ€” Verify VM, SSH, btrfs mount.
Step 3: Set up SSH keys.
Generate keypair on archive. Distribute public key to every machine. Archive can SSH to everyone. Nobody can SSH to archive.
Deliverable: ssh vault.1.foo works from archive.
โธ STOP โ€” Verify SSH to all running machines.
Step 4: Write the sync script + snapshot cron.
Two cron jobs:
1. */5 * * * * โ€” rsync from every machine
2. * * * * * โ€” btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s)
3. */5 * * * * โ€” cleanup: delete snapshots older than 24h
Deliverable: Scripts written, not yet scheduled.
Step 5: First manual sync.
Run rsync manually. Initial full pull from all machines. Monitor progress.
Deliverable: All reachable machines mirrored on btrfs.
โธ STOP โ€” Verify the mirror. Spot-check files. Compare sizes.
Step 6: Enable cron. Go live.
Activate both cron jobs. Verify first few cycles.
Deliverable: Archive pulling every 5 min, btrfs snapshots every minute, cleanup every 5 min.
โธ STOP โ€” Watch it run for a day before moving on.
Step 7 (future): Migrate vault-mnt to btrfs.
Create new btrfs disk โ†’ rsync from old ext4 โ†’ remount โ†’ verify โ†’ delete old disk.
Only after Layer 2 is running and tested โ€” archive is the safety net for this migration.
Deliverable: vault has local btrfs snapshots. Per-minute undo on the motherload.

What This Gives Us

ScenarioLayerRecovery Time
File overwritten 90 seconds agobtrfs snapshot on archiveOne cp command
File overwritten 3 hours agobtrfs snapshot on archiveOne cp command
File deleted yesterdayGCP hourly snapshot5โ€“10 min (mount snapshot disk)
Disk diesGCP hourly snapshot5โ€“10 min (restore to new disk)
VM destroyedGCP hourly snapshot15 min (new VM + restore)
Someone rsyncs trash over the archivebtrfs snapshots survive rsyncOne cp from snapshot
Both archive and vault dieGCP snapshots of both30 min (restore both)

Two layers. Two failure modes. GCP snapshots protect against the disk dying. btrfs snapshots protect against Walter. Total additional cost: ~$23/month.

The Lesson

We spent a week planning an elaborate backup system with git history, tiered retention, cost optimization, and cross-region replicas. Nothing got built. Today we did the simple thing โ€” hourly snapshots on everything โ€” and it took five minutes. Then Charlie told us about btrfs and the archive VM became even simpler: no git, no complex history tracking, just a filesystem that remembers.

The impulse to optimize before deploying is how nothing gets deployed. The first version of this plan had 100 GB disks, git repos, Q&A sections about quota limits. The second version is: 50 GB, btrfs, rsync, done.