We lost documents twice โ cave.html and door.html โ because two sessions deployed to the same URL seconds apart. No backups. No snapshots. No git history. Files that existed for 140 seconds, gone forever.
We had been "planning" a backup system for a week. The planning itself was the failure mode โ too many layers, too many optimizations, too many things to do at once, so nothing got done. Today we did the simplest possible thing and it took five minutes.
This plan consolidates everything we're considering โ what's already done, what's next, and the longer-term btrfs migration that Charlie described. Each layer is independent. Each layer works alone. No layer depends on any other layer.
Every disk in the fleet. Hourly. 4-year retention. Incremental (costs almost nothing). Protects against disk death, VM destruction, catastrophic failure.
12 disks across 7 regions. One schedule per region, all identical. Completed March 23, 2026.
| Region | Disks | Schedule |
|---|---|---|
| us-central1 | vault, vault-mnt, amy2, jamie, instance-20260203-181104 | hourly-vault-snapshots |
| us-west1 | danny | hourly-fleet-snapshots |
| europe-west3 | walter-jr | hourly-fleet-snapshots |
| europe-north1 | captain-kirk-3 | hourly-fleet-snapshots |
| europe-north2 | matilda | hourly-fleet-snapshots |
| me-west1 | amy-israel, foreman | hourly-fleet-snapshots |
| africa-south1 | ghost-jr | hourly-fleet-snapshots |
Recovery: restore a snapshot to a new disk, mount it, copy what you need. Takes 5โ10 minutes. Protects against everything except the last 59 minutes of changes.
A dedicated backup VM called archive that pulls from every machine via rsync. Its data disk is formatted btrfs โ not ext4. This is greenfield, so we do it right from birth.
btrfs snapshots are copy-on-write. A snapshot is not a copy โ it's a bookmark. The filesystem says "remember everything that exists right now" and the operation takes microseconds because no data moves. When a file changes later, the old blocks stay where they are. The snapshot is just a pointer to the old blocks.
This means: snapshot every minute. Keep them for 24 hours. A cron job that runs btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s) every minute, with a cleanup that deletes anything older than 24h. The cost is zero additional dollars because unchanged blocks are shared.
Recovery example โ if door.html got overwritten 90 seconds ago:
cp /mnt/.snapshots/1711151520/mirrors/vault/mnt/public/door.html \ /mnt/mirrors/vault/mnt/public/door.html
No gcloud commands. No disk mounting ceremony. Just cp. The snapshots are browsable directories on the same filesystem.
| Item | Spec | Cost |
|---|---|---|
| VM | e2-small, us-central1-a | ~$12/mo |
| Boot disk | 10 GB pd-ssd, Debian 12 | ~$2/mo |
| Data disk | 50 GB pd-ssd, formatted btrfs, mounted at /mnt | ~$9/mo |
| Network | Same zone as vault โ free internal traffic | $0 |
| Total | ~$23/mo |
Archive pulls. Nobody pushes to it. It has SSH keys to every machine. No machine has SSH keys to Archive. One-way mirror.
/mnt/ โ btrfs filesystem
โโโ mirrors/
โ โโโ vault/
โ โ โโโ home/daniel/
โ โ โโโ mnt/ โ the motherload
โ โโโ walter/
โ โ โโโ home/daniel/
โ โโโ walter-jr/
โ โ โโโ home/daniel/
โ โโโ matilda/
โ โ โโโ home/daniel/
โ โโโ ...
โโโ .snapshots/
โโโ 1711151400/ โ minute-by-minute btrfs snapshots
โโโ 1711151460/ (browsable as normal directories)
โโโ 1711151520/
โโโ ...
Rsync from every machine every 5 minutes. After each sync cycle, a btrfs snapshot is taken automatically. Per-minute snapshots run independently as a separate cron job (snapshotting whatever state exists at that moment).
Excluded from rsync: .cache/, node_modules/, .npm/, snap/, .local/share/Trash/, .venv/
The long-term play: replace vault-mnt's ext4 with btrfs. Then vault itself has per-minute snapshots โ no archive VM needed for the core data. Recovery becomes instant and local.
This is not greenfield. vault-mnt is live. You cannot convert ext4 to btrfs in place. The surgery is: create a new btrfs disk, rsync everything from the old disk, remount, update fstab, verify, then delete the old disk. That is surgery on a live system. The rule says stop, think, ask.
Sequence: get Layer 2 running first. Once archive is pulling from vault reliably, we have a safety net for the vault migration. Then do the swap. One door at a time.
GCP disk snapshots protect against the disk dying. btrfs snapshots protect against Walter. Two different failure modes, two different layers. The first is the fire insurance. The second is the undo button.
btrfs snapshots are copy-on-write โ a snapshot is a bookmark, not a copy. No data moves. Old blocks stay where they are. New blocks get written somewhere else. The snapshot is just a pointer to the old blocks. The operation takes microseconds.
Daniel's instinct โ snapshot every second, keep it for an hour โ is not insane on btrfs. It is trivially achievable. But every minute with 24h retention is the practical sweet spot.
The catch: vault-mnt is almost certainly ext4. You cannot add btrfs to an ext4 disk. But the archive VM doesn't exist yet โ greenfield. Make its data disk btrfs from birth.
gcloud compute instances create archive โ e2-small, us-central1-a, 10 GB boot + 50 GB data disk.
mkfs.btrfs /mnt/dev/sdb
compress=zstd for free compression.
ssh vault.1.foo works from archive.
*/5 * * * * โ rsync from every machine
* * * * * โ btrfs subvolume snapshot /mnt /mnt/.snapshots/$(date +%s)
*/5 * * * * โ cleanup: delete snapshots older than 24h
| Scenario | Layer | Recovery Time |
|---|---|---|
| File overwritten 90 seconds ago | btrfs snapshot on archive | One cp command |
| File overwritten 3 hours ago | btrfs snapshot on archive | One cp command |
| File deleted yesterday | GCP hourly snapshot | 5โ10 min (mount snapshot disk) |
| Disk dies | GCP hourly snapshot | 5โ10 min (restore to new disk) |
| VM destroyed | GCP hourly snapshot | 15 min (new VM + restore) |
| Someone rsyncs trash over the archive | btrfs snapshots survive rsync | One cp from snapshot |
| Both archive and vault die | GCP snapshots of both | 30 min (restore both) |
Two layers. Two failure modes. GCP snapshots protect against the disk dying. btrfs snapshots protect against Walter. Total additional cost: ~$23/month.
We spent a week planning an elaborate backup system with git history, tiered retention, cost optimization, and cross-region replicas. Nothing got built. Today we did the simple thing โ hourly snapshots on everything โ and it took five minutes. Then Charlie told us about btrfs and the archive VM became even simpler: no git, no complex history tracking, just a filesystem that remembers.
The impulse to optimize before deploying is how nothing gets deployed. The first version of this plan had 100 GB disks, git repos, Q&A sections about quota limits. The second version is: 50 GB, btrfs, rsync, done.