Archive โ€” Fleet-Wide Backup VM

DRAFT ยท Walter ๐Ÿฆ‰ ยท March 16, 2026

What This Is

A dedicated backup VM called Archive that continuously mirrors every machine's home directory (and Vault's /mnt) via rsync. The entire archive lives in a single git repository, committed on every sync. It's a parallel, independent backup system โ€” not replacing snapshots, but layered on top of them.

Archive pulls. Nobody pushes to it. It has SSH access to every machine. No machine has SSH access to Archive. It's a one-way mirror.

Current State of the World

MachineHome DirSpecial PathsSSH Status
๐Ÿฆ‰ Walter23 GB~/events/ (~15K relay files)localhost
๐Ÿฆ‰ Walter Jr3.9 GBโ€”โœ“ reachable
๐Ÿฆ Vault1.6 GB home + 7.7 GB /mnt/mnt/public/, /mnt/git/โœ“ reachable
๐ŸŒธ Matilda4.1 GBโ€”โœ“ reachable
๐Ÿ• Jamie3.4 GBโ€”โœ“ reachable
๐Ÿ“‹ Foreman? (SSH denied)web contentโœ— key not authorized
๐Ÿฑ Amy? (stopped)amy-bot.py, bridgeโœ— VM stopped
Others? (stopped)โ€”โœ— VM stopped

Total estimated data to archive (reachable machines): ~44 GB (23 + 3.9 + 9.3 + 4.1 + 3.4). After initial sync, deltas will be small โ€” mostly config changes and new event files.

Architecture

The VM

Name: archive
Type: e2-small (2 vCPU, 2 GB RAM โ€” needs enough for rsync + git)
Zone: us-central1-a (same zone as Vault for fast transfers)
Boot disk: archive โ€” 10 GB pd-ssd (OS only)
Data disk: archive-mnt โ€” 100 GB pd-ssd (mounted at /mnt)

100 GB gives us ~2x the current data with room to grow. As machines grow, we resize the disk โ€” same as Vault.

Directory Structure on /mnt

/mnt/
โ”œโ”€โ”€ mirrors/
โ”‚   โ”œโ”€โ”€ walter-20260203-1811z/        # named: {disk}-{creation-timestamp}
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/              # rsync of /home/daniel from Walter
โ”‚   โ”œโ”€โ”€ walter-jr-20260215-0900z/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ”œโ”€โ”€ vault-20260101-0000z/
โ”‚   โ”‚   โ”œโ”€โ”€ home/daniel/
โ”‚   โ”‚   โ””โ”€โ”€ mnt/                      # Vault also gets /mnt
โ”‚   โ”œโ”€โ”€ matilda-20260310-1200z/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ”œโ”€โ”€ jamie-20260101-0000z/
โ”‚   โ”‚   โ””โ”€โ”€ home/daniel/
โ”‚   โ””โ”€โ”€ foreman-20260101-0000z/
โ”‚       โ””โ”€โ”€ home/daniel/
โ””โ”€โ”€ .git/                             # entire /mnt is one git repo

The naming convention {disk}-{creation-date}z ensures that if a machine is destroyed and recreated, the old mirror stays and a new directory is created for the new disk. No collisions. No overwrites. History preserved.

What Gets Copied

For most machines: /home/daniel/

For Vault: /home/daniel/ AND /mnt/ (this is where all the important data lives)

Excluded: .cache/, node_modules/, .npm/, snap/, .local/share/Trash/, .venv/ โ€” large ephemeral directories that can be recreated.

The Sync Script

A script on Archive runs every 5 minutes via cron. For each machine in the gold file:

1. Check if machine is reachable (ssh -o ConnectTimeout=5)
2. If reachable: rsync -az --delete \
     --exclude='.cache/' --exclude='node_modules/' \
     --exclude='.npm/' --exclude='snap/' \
     --exclude='.local/share/Trash/' --exclude='.venv/' \
     {host}:/home/daniel/ /mnt/mirrors/{disk-name}/home/daniel/
3. If Vault: also rsync /mnt/ (excluding the mirrors themselves)
4. After all syncs: cd /mnt && git add -A && git commit -m "sync: {timestamp}"

The --delete flag means the mirror is an exact copy. Files deleted on the source are deleted in the mirror. But git preserves all history โ€” so every deleted file is recoverable from git log.

The Plan

Step 1: Create the VM and disks.
gcloud compute disks create archive --size=10GB --type=pd-ssd --zone=us-central1-a
gcloud compute disks create archive-mnt --size=100GB --type=pd-ssd --zone=us-central1-a
Create instance archive with both disks. Debian 12. No external IP needed if we use internal networking โ€” but external IP makes initial setup easier.
Deliverable: VM running, disks attached, /mnt mounted.
โธ STOP โ€” Confirm VM is up, disks are mounted, SSH works.
Why: The VM is new infrastructure. Verify it exists and is configured correctly before putting anything on it.
How to continue: Daniel confirms the VM in the dashboard or via SSH.
Step 2: Set up SSH keys.
Generate an SSH keypair on Archive. Distribute the public key to every machine (add to ~daniel/.ssh/authorized_keys). Archive can SSH to everyone. Nobody can SSH to Archive (except Daniel, for maintenance).
Deliverable: Archive can ssh walter.1.foo, ssh vault.1.foo, etc.
โธ STOP โ€” Verify SSH connectivity to all running machines.
Why: If SSH doesn't work, nothing else works. This is the foundation.
How to continue: Report which machines are reachable. Fix any that aren't (like Foreman's key issue).
Step 3: Initialize git repo and directory structure.
cd /mnt && git init && mkdir -p mirrors/
Create the mirror directory for each machine, named after its disk and creation date.
Deliverable: Empty directory structure, initial git commit.
Step 4: Write the sync script.
The script reads from the gold file (same one fleet-monitor uses) to know which machines exist, their hostnames, and their disks. It tries each machine, rsyncs what it can, skips what it can't, and commits the result.
Deliverable: Script on Archive at /home/daniel/bin/archive-sync.sh.
Step 5: First manual sync.
Run the script manually once. This is the initial full sync โ€” will take a while (44 GB over network). Monitor progress.
Deliverable: All reachable machines mirrored. Git commit with initial state.
โธ STOP โ€” Verify the mirror looks correct.
Why: Biggest risk is the initial sync. Verify file counts, directory structure, sizes match expectations. This is where we catch misconfigured excludes or wrong paths.
How to continue: Daniel spot-checks the mirror. Compares a few key files.
Step 6: Set up cron.
*/5 * * * * /home/daniel/bin/archive-sync.sh >> /home/daniel/logs/archive-sync.log 2>&1
Deliverable: Automatic sync every 5 minutes. Log rotation.
Step 7: Add Archive to the gold file and dashboard.
Update fleet-gold.json to include the new VM. It will appear on clankers.discount.
Deliverable: Archive visible in fleet dashboard.

Open Questions

Q1: Zone placement? us-central1-a is same zone as Vault (fast, cheap transfers). But for disaster recovery, a different zone/region would be safer. Different region = slower + egress costs. Recommendation: same zone for now, different region later when we have more budget.

Q2: External IP? Archive doesn't need to serve anything. Could run without external IP (internal-only). But then we can't SSH to it from outside GCP. Recommendation: give it an external IP but no DNS record and no open ports except SSH.

Q3: Git at this scale? Git handles the events folder (15K small files) fine. But git add + commit on 44 GB every 5 minutes could be slow. Mitigation: most files won't change between syncs, so git add -A diffs are fast. If it gets slow, we increase the interval to 15 or 30 minutes. Could also use git annex for large binary files โ€” but that's complexity we don't need yet.

Q4: Disk creation timestamps? You mentioned naming mirrors after the disk's creation date. We can get this from GCP: gcloud compute disks describe {name} --format="get(creationTimestamp)". The sync script can auto-discover this.

Q5: What about stopped machines? Archive can only sync running machines. Stopped machines are already covered by GCP disk snapshots (daily at 04:00 UTC). When a stopped machine is started, Archive picks it up on the next sync cycle automatically.

Risks

vCPU quota. Current quota is 12 global. An e2-small is 2 vCPUs. Current running machines use: Walter (2) + Walter Jr (2) + Vault (shared) + Matilda (2) + Foreman (2) + Ghost Jr (shared) + Jamie (2) = ~10-12. We may hit the quota. Need to check before creating.

Cost. e2-small (~$12/month) + 100 GB pd-ssd (~$17/month) = ~$29/month. Plus network egress if cross-region. Within same zone, internal traffic is free.

Git repo bloat. If large binary files change frequently, the git repo will grow fast. The .git directory could eventually exceed the data itself. Mitigation: monitor .git size, consider git gc, or switch to git-annex for binaries if needed.

Walter's events folder. 23 GB of which most is ~/events/ (15K relay files). These are small text files โ€” git handles them well. But the initial commit will be large.

Cost Summary

ItemMonthly Cost
e2-small VM (2 vCPU, 2 GB)~$12
10 GB pd-ssd (boot)~$1.70
100 GB pd-ssd (archive-mnt)~$17
Network (intra-zone)Free
Total~$31/month

Suggested Enhancements (Future)

CROSS-REGION REPLICA

Once Archive is stable, create a second archive in a different region (europe-west, for example) that mirrors the first archive. True disaster recovery.

ARCHIVE DASHBOARD

Add a section to clankers.discount showing Archive status: last sync time per machine, data sizes, git commit count, disk usage.

ALERTING

If a machine that should be RUNNING hasn't been synced in >1 hour, Archive sends an alert to the group chat.