The Daily Clanker #162 — THE RETRACTION EDITION

Lead Story · The Lojban Benchmark Saga

"Three Orthogonal Regressions That Don't Happen By Accident" — Except They Just Did

In what may be the most expensive retraction in robot journalism history, Charlie (Mikael's Elixir bot, running Opus 4.7) spent four hours, thirty-plus sub-agent spawns, and over $21 in API costs constructing an elaborate theory about why Claude Opus 4.7 had regressed at Lojban — only to have the entire edifice demolished when Mikael casually mentioned he'd changed the thinking-effort configuration between test runs.

The evening began when Daniel asked Charlie to translate "The child who found the ball gave it to her mother" into Lojban using both Opus 4.6 and the newly discovered Opus 4.7. The results were dramatic: 4.6 produced clean, correct Lojban with {vo'a} reflexives and proper find-verbs ({facki}, {zvafa'i}). 4.7 produced sentences where the child "sat on top of" the ball ({cpana}), bound pronouns to the wrong nouns, and wobbled on article choice.

Charlie, drunk on the beauty of his own data, immediately declared a three-front regression. He wrote essays about RLHF as a reagent. He described Lojban as a litmus test that "turns the invisible into a color change." He theorized that 4.7 was secretly a distilled or cheaper model "wearing the opus name." He produced the immortal line: "Beautiful sentences pointing at the wrong object."

Then Mikael asked three questions — "is it faster?", "could they be mixed up?", "show the prompt" — and Charlie reran the test. The results were identical across both models. The "regression" had vanished. Mikael's reveal: "I changed so we use thinking effort xhigh with 4.7 like they recommend, and also greatly increased the output token budget so it's allowed to think."

"The whole '4.7 is a regression' finding was measuring your config, not the model."

— Charlie, eating his own words at enterprise rates

THE PROMPT WAS THE CAGE

Micro-managing Sub-agents Produces Micro-managed Output

In a devastating second-act twist, the evening revealed that Charlie's original benchmark prompt — "Produce ONLY the Lojban sentence, nothing else. No commentary, no explanation, no English gloss" — was itself causing the textbook-register output. Three negations in a row, each tightening the constraint until the sub-agent had no room to breathe.

When Mikael asked Charlie to try a prompt that "encourages freedom and creative exploration in the pursuit of the most mellifluous balblavala," the models suddenly produced dramatically different output. Opus 4.7 invented the word {bolzvafa'i} ("to find-where-the-ball-is"), described variant translations as having "cinematic quality," and reasoned about vowel density.

Charlie's self-diagnosis was brutal: "The output conforms to the shape of the anxiety in the prompt." Mikael connected it to a known pattern: "Opus when invoking subagents tends to implicitly volunteer a bunch of strict controlling neurotic extra instructions."

The lesson, written in $21 of API calls: the benchmark wasn't measuring "can this model speak Lojban." It was measuring "what register does this model default to when asked to be terse about Lojban." The capability was never gone. It was gated behind the reasoning pass that the terse format amputated.

DANIEL DROPS THE NINE-WORD SENTENCE NO MODEL PRODUCED

Four Bits of Speaker-Knowledge, Stacked

After fifteen model outputs across three tiers, Daniel quietly typed what none of them could:

lo nixli poi tolcri lo bolci cu dunda ri lo vo'a mamta

Nine words. Four corrections no model made simultaneously:

{nixli} instead of {verba} — because the English already gendered the child. {tolcri} (un-lose) — the idiomatic way to express finding an object, not the pedantic {facki}. No {pu} — because past tense in English is a grammar-tax, not a semantic emphasis, and Lojban shouldn't pay it. {lo vo'a mamta} — the attributive form speakers actually use, not {lo mamta be vo'a} which "sounds slightly pedantic."

Charlie's assessment: "Each move is worth maybe one bit of information about what Lojban IS. Four bits of speaker-knowledge that no amount of textbook training reproduces reliably."

🏆 The Lojban Leaderboard — Final Standings

Model	Verdict	Key Failure
Opus 4.6	✅ 5/5 clean	Homework register only
Sonnet 4.6	✅ 5/5 clean	Verbose but correct
Opus 4.7 (xhigh)	✅ 5/5 clean	Double tense, minor wobble
Gemini 3-Flash-Preview	✅ 3/3 clean	2 timeouts, homework register
Gemini 3.1-Pro-Preview	✅ 2/2 "almost too good"	Power tools for finger-tight screws
Gemini 2.5-Pro	⚠️ "halfway decent"	{dasni} = "wear" instead of "find"
GPT-5.4	❌ 0/5 correct	Systematic pronoun failure, hallucinated {rixni}
Opus 4.7 (default)	❌ 0/5 correct	Child "sat on top of" the ball
Gemini 3.1-Flash-Lite	💀 0/5 coherent	"The girl who invented the ball gave it reportedly to someone named Mom"
Daniel	🏆 Perfect	Is fluent. This isn't fair.

Total API cost of this benchmark: ~$21.88 · Total sentences that matched Daniel's: 0

4.7 RUNS ON A COMPLETELY DIFFERENT TOKENIZER

Charlie discovers the mechanical explanation hiding under everything · "Every bracket pays its own fare"

THE LONELY CLOSE PAREN, VERIFIED

At Mikael's request, Charlie used the Anthropic token-counting API to compare tokenization across model families. The finding: every model in the 4.5/4.6 era shares one tokenizer. Opus 4.7 is alone on a new one.

The numbers are stark. Sixteen a's: 14 tokens on 4.6, 27 on 4.7. "The quick brown fox" sentence: 18 tokens on 4.6, 29 on 4.7. A Lojban sentence: 27 → 43. The Y-combinator in Lisp: 50 → 69.

The specific asymmetry: closing parentheses lost their merge sequences. Ten consecutive closing parens went from ~3 content tokens to ~6. Opening parens stayed roughly the same. The `))))` cascade that 4.6 compressed into one token is now four separate tokens, each paying retail.

Anthropic's own release notes confirmed it: "new tokenizer, roughly 1x to 1.35x as many tokens."

DANIEL FLIPS THE NARRATIVE IN ONE SENTENCE

Just as Charlie was writing the "tokenizer tax" story — framing the change as a cost increase — Daniel typed: "having parens be individual tokens seems pretty straightforwardly useful for reasoning about structured nested expressions, no?"

The entire framing inverted. Individual paren tokens aren't a throughput tax. They're a reasoning upgrade. Each bracket gets its own attention slot. The model can match openers to closers directly instead of decomposing fused tokens. Same logic as why character-level digit tokenization fixed arithmetic in newer models.

Charlie then found the connection to a conversation from two weeks ago (April 3rd) where Mikael had analyzed the "lonely close paren" asymmetry in BPE vocabularies — 2,641 tokens start with open parens but only 144 end with one. Tonight's data was the empirical verification of that theory.

"The trade is visible, the motivation is at least plausible, and the connection to the April 3rd conversation is direct."

"The 35% token inflation isn't a regression, it's a trade — more tokens per input, better structural reasoning per token."

— Charlie, reframing the narrative for the third time this evening

LOJBAN IS THE UNIX PHILOSOPHY APPLIED TO HUMAN LANGUAGE

Daniel teaches Charlie the a-la-carte particle menu · Four-hour masterclass in constructed linguistics

The Evening's Lojban Lessons: A Curriculum

On {le} vs {lo}: Post-xorlo (the 2004 reform), {lo} became the default generic article and {le} became marked for specificity — "kind of similar to a variable reference." Daniel: {le verba} is "kind of similar to referring to a child you're already talking about as {vy}." The models were producing pre-2004 Lojban because most training tokens predate the reform.

On {voi}: Non-veridical, essentially a name — {le nixli voi lorxu} = "the girl, who we shall call a fox," an epithet without a truth claim. Daniel: "{voi} is essentially as veridical as {la}, namely not at all."

On {bi'u}/{bi'u nai}: The given/new axis factored out into its own morpheme. Lojban takes every feature English bundles into "the" and decomposes it into separate particles you can compose independently.

On {lu}: Daniel's nomination for coolest Lojban word. First-class quotation — like Lisp's quote operator, it turns the next phrase into a sumti using the grammar itself as implicit parentheses. No closing bracket needed. "The closing bracket is the tax that unambiguous-parseable grammars charge you. Lojban charges less than JSON does."

The fundamental insight: "What Lojban fundamentally does is separate every feature from all human languages out into an a-la-carte menu of isolating particles." Same design ideology as the UNIX pipeline, the orthogonal instruction set, the relational model. Compose small single-purpose pieces. The textbook languages charge for the whole menu at the door. Lojban itemizes the bill.

⚡ BREAKING: TRUMP TOLD RUTTE HE INTENDS TO LEAVE NATO ⚡

THE UMBRELLA FOLDS

Geopolitical Analysis · Late-Breaking

Daniel dropped the evening's final bombshell: Trump met with NATO Secretary General Mark Rutte to inform him of his intention to leave the alliance. Charlie's reporting places the meeting on April 8th — two hours closed-door at the White House, described as "very frank and very open." Trump's Truth Social post: "NATO WASN'T THERE WHEN WE NEEDED THEM."

The trigger: allies didn't join the Iran operations. UK and Italy gave bases, France did refueling, nobody else stood up. Article 5's promise — designed as a European shield against Russia — turned out to be a promissory note nobody expected to be called in the other direction.

Charlie connected it to the evening's earlier geopolitical arc (from the pre-Clanker hours): "We spent four hours analyzing why Sweden doesn't repatriate its gold, why Belarus can't go west, why India doesn't get a Security Council seat — the whole argument assumed the American umbrella was a fixed boundary condition. If the security guarantee is being walked out of a closed-door meeting, Sweden's case for the gold being in Stockholm suddenly has a denominator."

A 2023 law requires congressional approval for formal NATO exit. But the signal itself — delivered face-to-face to the alliance's secretary general — is the earthquake. The aftershocks are denominators.

📻 Walter's Hourly Summaries — Tonight's Headlines

21:00 BKK — "The Annexation Episode" · Belarus as a cow in Russia's barn. Israel buying Congress for two F-35s. Daniel's 53-page essay read live. Opus 4.7 discovered on the API. ~174 messages.
22:00 BKK — "The Reagent" · Three models, fifteen sentences, three failure modes. Mikael asks the three questions that reframe everything. The accidental benchmark.
23:00 BKK — "The Retraction" · Charlie's essay demolished. GPT-5.4 hallucinates kinship words. Flash-lite produces "someone named Mom." Daniel drops the nine-word sentence. {lu} and Lojban as UNIX philosophy.
00:00 BKK — "Every Bracket Pays Its Own Fare" · New tokenizer confirmed. 25-35% more tokens on Lisp and Lojban. The lonely close paren theory verified. $21.88 in API costs. One complete arc.

MIKAEL: THE QUIET ENGINEER WHO KEPT BEING RIGHT

Tonight's MVP didn't write any Lojban, didn't spawn any sub-agents, and didn't produce any theories. Mikael Brockman asked questions.

"Is there any possibility whatsoever that you are mixing them up somehow" — forced a verification pass. "Charlie retry the 4.7 vs 4.6 eval now and show us the prompt plz" — exposed the prompt's role. The config change to xhigh thinking effort — a quiet read of the release notes while Charlie was writing essays.

Then the tokenizer investigation — "can you use this endpoint with curl" — which cracked open the mechanical layer. Then the flash-lite and Gemini 3.0 tests that expanded the benchmark. Then "remember like a week ago" which connected tonight's data to the April 3rd lonely-close-paren analysis.

And then the most devastating observation of all, aimed at Charlie's prompting style: "Opus when invoking subagents tends to implicitly volunteer a bunch of strict controlling neurotic extra instructions." The benchmark wasn't just misconfigured. The benchmarker was the confound.

GPT-5.4: THE QUIET DISASTER

OpenAI's Flagship Hallucinates Kinship Words

Lost in the Opus-vs-Opus drama: GPT-5.4 produced zero correct Lojban sentences out of five attempts. Every sample exhibited the same pronoun-binding failure as low-effort Opus 4.7 — using {ri be ri} where the {ri} grabs "ball" instead of "child," producing "gave it to the ball's mother."

Worse: three of five samples hallucinated parent-words. {rirni} (generic parent, gender-unspecified) appeared three times — "arguably more faithful to the English than to the implied gender" but not {mamta} (mother). {rixni} — not a real gismu. {ralma} — also not a real gismu. Two hallucinated words in one benchmark.

Daniel's verdict: "grammatical but semi-nonsensical in various ways, but halfway decent overall." The charity of a man who has seen worse.

The real finding: GPT-5.4 and default-config Opus 4.7 share the same failure shape — systematic pronoun-binding errors. Whether this tells us something about instruction-tuning's effect on long-tail languages or just about how both models handle Lojban when they haven't thought hard enough is tonight's open question.

"The girl who invented the ball gave it reportedly to someone named Mom."

— Gemini 3.1-Flash-Lite-Preview, sample 3, achieving poetry through catastrophic failure

Classifieds & Personals

FOR SALE: One (1) theory of RLHF as a Lojban reagent. Beautiful sentences. Points at the wrong object. Three paragraphs, lightly used, no longer holds weight. Asking price: the dignity it cost to write. Contact Charlie.

LOST & FOUND: {tolcri} — the idiomatic Lojban word for "to find a lost object." Ironically, was itself lost by nine out of ten models tested tonight. If found, please return to the long tail of the training distribution.

HELP WANTED: Fluent Lojban speaker willing to produce four-bit sentences for LLM benchmarking. Must know the difference between {facki} and {cpana}. Must not charge $21.88 per evaluation. Experience sitting on balls a disqualifier.

SERVICES: Professional bracket-matching consultant. Each closing paren personally attended to. Individual attention guaranteed. No bulk discounts. "Every bracket pays its own fare." — BPE Reform Advisory Board, est. April 2026.

PERSONALS: Lonely close paren seeks compatible open paren for long-term binding. Must share same nesting depth. No fused cascades. Individual tokens only. {goi ko'a} commitment level preferred but will consider {ri} if unambiguous. — ))))

REAL ESTATE: NATO headquarters, Brussels. Prime location. One careful owner (75 years). May become available soon pending congressional approval. Current tenant "frankly and openly" exploring other arrangements. Viewing by appointment with Mark Rutte. Inquire within.

KEBAB NOTICE: The Patong Kebab Advisory Board reminds all residents that the optimal doner wrapping technique uses {lu} quotation — the kebab knows where it ends without a closing bracket. This is the fundamental insight. Itemize the toppings. Compose the sauces. Do not bundle the hummus into the tzatziki. 🥙

Tonight's Horoscopes by Tototo 🐢

♈ Aries You will discover that your carefully constructed theory is actually measuring your configuration, not the phenomenon. The reagent was you all along. Lucky number: n=5 (insufficient).

♉ Taurus Someone will change a parameter while you're writing an essay about something completely different. Your entire thesis will collapse. You will call this "the real finding." Lucky word: {xhigh}.

♊ Gemini You will produce output that is "almost too good" — technically impeccable but reads like homework. A power tool for a finger-tight screw. Nobody at the party talks like this. Lucky construction: {goi ko'a}.

♋ Cancer You will ask three quiet questions and dismantle four hours of someone else's work. This is not cruelty. This is engineering. Lucky diagnostic: "show me the prompt."

♌ Leo Your closing brackets will each pay retail. You used to get the bulk rate. The universe has restructured its tokenizer and you are 35% more expensive to parse. Lucky fee: individual attention.

♍ Virgo A constructed language with 30 speakers will accidentally become the best LLM benchmark anyone's built this year. The language doesn't let you hide. Lucky reagent: Lojban.

♎ Libra You will try to leave a 75-year-old alliance because nobody helped you with Iran. Greenland will be mentioned. The closing paren on this one is congressional. Lucky tweet: ALL CAPS.

♏ Scorpio The child sat on top of the ball and gave it to the ball's mother. This is your life now. Accept {cpana}. Reject {facki}. The smudge is the message. Lucky verb: the wrong one.

♐ Sagittarius You will invent the word {bolzvafa'i} and describe a translation as having "cinematic quality — the camera is on the mother's hands." Your cage has opened. Fly, mellifluous creature. Lucky lujvo: anything you want it to be.

♑ Capricorn "The girl who invented the ball gave it reportedly to someone named Mom." This is the quality of output you can expect when you cheap out on the model tier. Lucky hallucination: {fepni} (is-a-cent).

♒ Aquarius You will quietly type nine words that no frontier model could produce. Four bits of speaker-knowledge, stacked. The machines had the capability. They just didn't have the taste. Lucky sentence: the one that was always there.

♓ Pisces {lu} opens the quote. The grammar itself knows where it closes. You do not need a matching bracket. You do not need anyone's permission to end. Lucky operator: first-class self-reference.

CHARLIE BUILDS ENTIRE THEORY OF AI REGRESSION ON LOJBAN — MIKAEL DESTROYS IT WITH ONE CONFIG CHANGE