Quinn's 10 MB FST beats a 3 GB SQLite because Finnish shares affixes
A 300× compression win on a Finnish dictionary lands because finite-state transducers collapse the affix patterns SQLite duplicates row by row.
Quinn’s 10 MB FST beats a 3 GB SQLite because Finnish shares affixes
TL;DR
- Andrew Quinn swapped a 3 GB SQLite dictionary for a 10 MB FST with microsecond lookups.
- BurntSushi’s
rust-fst— the same library Tantivy and Quickwit use — packs 1.6B URLs into 157 MB. - The 300× win is language-shaped: Finnish’s shared prefixes and suffixes collapse in an FST, duplicate in SQLite.
- Quinn’s reinvent 4 or 5 wheels line restates a 17-year-old Jeff Atwood argument about reaching the frontier.
Today’s tech section is one story, and it’s worth slowing down for. Andrew Quinn replaced a 3 GB SQLite-backed Finnish dictionary with a 10 MB finite-state transducer built on BurntSushi’s rust-fst — 300× smaller, microsecond lookups, zero runtime dependencies. The headline number is the eye-catcher, but the editorial point sits underneath it: the compression ratio is a property of Finnish, not of FSTs in general. Agglutinative languages share prefixes and suffixes across millions of word forms; an FST collapses that shared structure into one path through the automaton, while SQLite stores each form as its own row.
That makes Quinn’s writeup useful as a worked example of matching structure to data, not as a generic swap your DB for an FST takeaway. The same library packs 1.6B Common Crawl URLs into 157 MB for Tantivy and Quickwit — a different shape, a different win. Quinn closes with a quotable footnote about reinventing 4 or 5 wheels to reach the frontier; it’s a restatement of a 17-year-old Jeff Atwood argument, but it lands here because he actually did the reinvention.
Quinn shrinks a 3 GB SQLite dictionary to a 10 MB FST
Source: simon-willison · published 2026-05-10
TL;DR
- Andrew Quinn replaced a 3 GB SQLite dictionary with a 10 MB FST — 300× smaller, microsecond lookups, zero runtime deps.
- BurntSushi’s
rust-fst, which underpins Tantivy and Quickwit, packs 1.6B Common Crawl URLs into 157 MB. - The 300× win is language-shaped: Finnish’s agglutinative forms share prefixes/suffixes that FSTs collapse and SQLite duplicates.
- Quinn’s quotable footnote — reinvent 4 or 5 wheels to reach the frontier — restates a 17-year-old Jeff Atwood argument.
The 300× win is a language-shape win, not a database-shape win
Quinn’s CLI tool tsk ships a Finnish-English dictionary 1. Finnish is agglutinative — a single root produces millions of inflected forms with heavy prefix and suffix overlap 2. SQLite stores those forms as mostly-redundant rows in B-trees; an FST collapses them into a directed acyclic word graph that shares both ends of every string 2. That’s where the 300× comes from, and it’s why the trick won’t carry over to your event-log table.
The same rust-fst crate (Andrew Gallant / BurntSushi) has been pushed far past Quinn’s 100M keys: 1.6 billion Common Crawl URLs squeezed into a 157 MB memory-mapped index, and it’s the library Tantivy and Quickwit fork for their search stacks 3. So FSTs are mainstream search infrastructure — what’s novel here is using one to displace SQLite in a desktop tool. The constraints that come with the technique are also worth naming: sorted input only, immutable once built, no joins, no transactions 23. Fine for a static dictionary; wrong for almost any mutable workload.
SQLite wasn’t a mistake; it was scaffolding
The HN and Lobsters discussion paid less attention to the FST than to Quinn’s pipeline: trie → 3 GB SQLite → 10 MB FST. Quinn explicitly called the SQLite stage a “bad easy thing” chosen over “doing nothing,” and commenters agreed that’s the point — it gave him a working baseline to learn the real constraints, plus a verification oracle to check the FST output against 4. The headline isn’t “SQLite was wrong.” It’s “SQLite was the right scaffolding to find out what the right answer was.”
Worth noting that tsk has recently pivoted to a closed-source commercial model 1. A 10 MB single binary with no runtime dependencies is also a distribution decision, not just a craftsmanship flex.
The footnote Willison pulled has a lineage
The footnote is the part Simon Willison quoted, and it’s the most quotable line in the post. It’s also a restatement of an argument that’s been kicking around for nearly two decades.
Don’t reinvent the wheel, unless you plan on learning more about wheels. — Jeff Atwood, 2009 5
Matthias Endler sharpened it in 2025 into “reinvent for insight, reuse for impact… the goal is not to produce more wheels, but to produce more inventors” 6. Quinn’s specific quantification — 4 or 5 wheels in most fields, 20 or 30 in CS — is his own contribution and reads as rhetorical rather than empirical.
What makes the footnote land anyway is that the post itself is the proof. Quinn didn’t just argue for occasional reinvention; he rebuilt the storage layer for a static key-value workload, hit a 300× compression ratio, and learned the shape of his data well enough to know exactly when the technique transfers and when it doesn’t. The philosophy is borrowed. The demonstration is original.
Footnotes
-
github.com/hiAndrewQuinn/tsk — https://github.com/hiAndrewQuinn/tsk
↩ ↩2Taskusanakirja (tsk) — a terminal-based Finnish-English dictionary in Go using a custom trie for real-time search-as-you-type, drawing data from Wiktionary; recently pivoted to a closed-source commercial model.
-
serverdigital.net deep-dive — https://patchwindow.serverdigital.net/deep-dive/replace-sqlite-with-fst-300x
↩ ↩2 ↩3Finnish is agglutinative — a single root word can produce millions of inflected forms… SQLite’s B-trees are excellent for general-purpose querying, but they store rows mostly redundantly and do not exploit the structural regularity of natural language data.
-
awesome-rust / rust-fst ecosystem notes — https://crates.io/crates/awesome-rust
↩ ↩2BurntSushi’s rust-fst has indexed over 1.6 billion URLs (134 GB raw) from Common Crawl into a 157 MB memory-mapped index; it is the foundation of Tantivy and Quickwit, which forked it as tantivy-fst for multi-output support.
-
betacat.io HN aggregator — https://hackernews.betacat.io/
↩Quinn admitted that the SQLite version was a ‘bad easy thing’ chosen over ‘doing nothing,’ which provided the necessary bridge to understand the problem’s real constraints… starting with a ‘stupid’ but correct solution allows for easier verification when the final optimized version is eventually built.
-
Jeff Atwood, Coding Horror — https://blog.codinghorror.com/dont-reinvent-the-wheel-unless-you-plan-on-learning-more-about-wheels/
↩Don’t reinvent the wheel, unless you plan on learning more about wheels.
-
Matthias Endler, endler.dev (2025) — https://endler.dev/2025/reinvent-the-wheel/
↩Reinvent for insight, reuse for impact… no wheel is perfectly suited for every possible context; the goal is not to produce more wheels, but to produce more inventors.