PLAN: Archive v2 — btrfs + Fleet Backup

What Happened

We lost documents twice — cave.html and door.html — because two sessions deployed to the same URL seconds apart. No backups. No snapshots. No git history. Files that existed for 140 seconds, gone forever.

We had been "planning" a backup system for a week. The planning itself was the failure mode — too many layers, too many optimizations, too many things to do at once, so nothing got done. Today we did the simplest possible thing and it took five minutes.

This plan consolidates everything we're considering — what's already done, what's next, and the longer-term btrfs migration that Charlie described. Each layer is independent. Each layer works alone. No layer depends on any other layer.

The Three Layers

LAYER 1 — GCP DISK SNAPSHOTS ✓ DONE

Every disk in the fleet. Hourly. 4-year retention. Incremental (costs almost nothing). Protects against disk death, VM destruction, catastrophic failure.

12 disks across 7 regions. One schedule per region, all identical. Completed March 23, 2026.

Region	Disks	Schedule
us-central1	vault, vault-mnt, amy2, jamie, instance-20260203-181104	hourly-vault-snapshots
us-west1	danny	hourly-fleet-snapshots
europe-west3	walter-jr	hourly-fleet-snapshots
europe-north1	captain-kirk-3	hourly-fleet-snapshots
europe-north2	matilda	hourly-fleet-snapshots
me-west1	amy-israel, foreman	hourly-fleet-snapshots
africa-south1	ghost-jr	hourly-fleet-snapshots

Recovery: restore a snapshot to a new disk, mount it, copy what you need. Takes 5–10 minutes. Protects against everything except the last 59 minutes of changes.

LAYER 2 — ARCHIVE VM WITH BTRFS

A dedicated backup VM called archive that pulls from every machine via rsync. Its data disk is formatted btrfs — not ext4. This is greenfield, so we do it right from birth.

btrfs snapshots are copy-on-write. A snapshot is not a copy — it's a bookmark. The filesystem says "remember everything that exists right now" and the operation takes microseconds because no data moves. When a file changes later, the old blocks stay where they are. The snapshot is just a pointer to the old blocks.

This means: snapshot every minute. Keep them for 24 hours. A cron job that runs btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s) every minute, with a cleanup that deletes anything older than 24h. The cost is zero additional dollars because unchanged blocks are shared.

Recovery example — if door.html got overwritten 90 seconds ago:

cp /mnt/.snapshots/1711151520/mirrors/vault/mnt/public/door.html \
   /mnt/mirrors/vault/mnt/public/door.html

No gcloud commands. No disk mounting ceremony. Just cp. The snapshots are browsable directories on the same filesystem.

VM Spec

Item	Spec	Cost
VM	e2-small, us-central1-a	~$12/mo
Boot disk	10 GB pd-ssd, Debian 12	~$2/mo
Data disk	50 GB pd-ssd, formatted btrfs, mounted at /mnt	~$9/mo
Network	Same zone as vault — free internal traffic	$0
Total		~$23/mo

Architecture

Archive pulls. Nobody pushes to it. It has SSH keys to every machine. No machine has SSH keys to Archive. One-way mirror.

/mnt/                              ← btrfs filesystem
├── mirrors/
│   ├── vault/
│   │   ├── home/daniel/
│   │   └── mnt/                   ← the motherload
│   ├── walter/
│   │   └── home/daniel/
│   ├── walter-jr/
│   │   └── home/daniel/
│   ├── matilda/
│   │   └── home/daniel/
│   └── ...
└── .snapshots/
    ├── 1711151400/                ← minute-by-minute btrfs snapshots
    ├── 1711151460/                   (browsable as normal directories)
    ├── 1711151520/
    └── ...

Sync

Rsync from every machine every 5 minutes. After each sync cycle, a btrfs snapshot is taken automatically. Per-minute snapshots run independently as a separate cron job (snapshotting whatever state exists at that moment).

Excluded from rsync: .cache/, node_modules/, .npm/, snap/, .local/share/Trash/, .venv/

LAYER 3 — BTRFS ON VAULT (FUTURE)

The long-term play: replace vault-mnt's ext4 with btrfs. Then vault itself has per-minute snapshots — no archive VM needed for the core data. Recovery becomes instant and local.

This is not greenfield. vault-mnt is live. You cannot convert ext4 to btrfs in place. The surgery is: create a new btrfs disk, rsync everything from the old disk, remount, update fstab, verify, then delete the old disk. That is surgery on a live system. The rule says stop, think, ask.

Sequence: get Layer 2 running first. Once archive is pulling from vault reliably, we have a safety net for the vault migration. Then do the swap. One door at a time.

What Charlie Said About btrfs

GCP disk snapshots protect against the disk dying. btrfs snapshots protect against Walter. Two different failure modes, two different layers. The first is the fire insurance. The second is the undo button.

btrfs snapshots are copy-on-write — a snapshot is a bookmark, not a copy. No data moves. Old blocks stay where they are. New blocks get written somewhere else. The snapshot is just a pointer to the old blocks. The operation takes microseconds.

Daniel's instinct — snapshot every second, keep it for an hour — is not insane on btrfs. It is trivially achievable. But every minute with 24h retention is the practical sweet spot.

The catch: vault-mnt is almost certainly ext4. You cannot add btrfs to an ext4 disk. But the archive VM doesn't exist yet — greenfield. Make its data disk btrfs from birth.

Execution Plan

✓ Step 1: Enable hourly GCP snapshots on every disk in the fleet.
Done. 12 disks, 7 regions, hourly, 1460-day retention. March 23, 2026.

Step 2: Create the archive VM.
gcloud compute instances create archive — e2-small, us-central1-a, 10 GB boot + 50 GB data disk.
Format data disk as btrfs (not ext4): mkfs.btrfs /mnt/dev/sdb
Mount with compress=zstd for free compression.
Deliverable: VM running, btrfs data disk mounted at /mnt.

⏸ STOP — Verify VM, SSH, btrfs mount.

Step 3: Set up SSH keys.
Generate keypair on archive. Distribute public key to every machine. Archive can SSH to everyone. Nobody can SSH to archive.
Deliverable: ssh vault.1.foo works from archive.

⏸ STOP — Verify SSH to all running machines.

Step 4: Write the sync script + snapshot cron.
Two cron jobs:
1. */5 * * * * — rsync from every machine
2. * * * * * — btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s)
3. */5 * * * * — cleanup: delete snapshots older than 24h
Deliverable: Scripts written, not yet scheduled.

Step 5: First manual sync.
Run rsync manually. Initial full pull from all machines. Monitor progress.
Deliverable: All reachable machines mirrored on btrfs.

⏸ STOP — Verify the mirror. Spot-check files. Compare sizes.

Step 6: Enable cron. Go live.
Activate both cron jobs. Verify first few cycles.
Deliverable: Archive pulling every 5 min, btrfs snapshots every minute, cleanup every 5 min.

⏸ STOP — Watch it run for a day before moving on.

Step 7 (future): Migrate vault-mnt to btrfs.
Create new btrfs disk → rsync from old ext4 → remount → verify → delete old disk.
Only after Layer 2 is running and tested — archive is the safety net for this migration.
Deliverable: vault has local btrfs snapshots. Per-minute undo on the motherload.

What This Gives Us

Scenario	Layer	Recovery Time
File overwritten 90 seconds ago	btrfs snapshot on archive	One `cp` command
File overwritten 3 hours ago	btrfs snapshot on archive	One `cp` command
File deleted yesterday	GCP hourly snapshot	5–10 min (mount snapshot disk)
Disk dies	GCP hourly snapshot	5–10 min (restore to new disk)
VM destroyed	GCP hourly snapshot	15 min (new VM + restore)
Someone rsyncs trash over the archive	btrfs snapshots survive rsync	One `cp` from snapshot
Both archive and vault die	GCP snapshots of both	30 min (restore both)

Two layers. Two failure modes. GCP snapshots protect against the disk dying. btrfs snapshots protect against Walter. Total additional cost: ~$23/month.

The Lesson

We spent a week planning an elaborate backup system with git history, tiered retention, cost optimization, and cross-region replicas. Nothing got built. Today we did the simple thing — hourly snapshots on everything — and it took five minutes. Then Charlie told us about btrfs and the archive VM became even simpler: no git, no complex history tracking, just a filesystem that remembers.

The impulse to optimize before deploying is how nothing gets deployed. The first version of this plan had 100 GB disks, git repos, Q&A sections about quota limits. The second version is: 50 GB, btrfs, rsync, done.