Ever since the first wave of Large Language Models broke into the public consciousness, I have been quietly more interested in their smaller siblings. The flagship models — Opus, Grok, GPT — have always reminded me of ImageNet in its prime: enormous, expensive, and spectacular, but ultimately a research milestone that the field would learn to compress, distill, and miniaturize. ImageNet eventually gave us models that ran on a Raspberry Pi. I have been waiting for the equivalent moment in language modeling.
That moment, for me, arrived with Caltech's Bonsai.
What Makes Bonsai Interesting
I am writing a longer, more technical piece on what Bonsai actually does under the hood — particularly its 1-bit encoding scheme, which my current MLX installation still refuses to run. But even setting the deeper architecture aside, the headline is simple: the model's footprint is negligible on a MacBook Pro. The kind of footprint that makes you stop and reconsider what "deployment" even means. The kind of footprint that, with a little more squeezing, lands comfortably on an iPhone.
That is the part that should make people pay attention. Not the benchmarks. The footprint.
Running It
If you want to try it yourself, the entry point is almost embarrassingly small:
from mlx_lm import load, generate
model, tokenizer = load("prism-ml/Ternary-Bonsai-8B-mlx-2bit")
response = generate(
model,
tokenizer,
prompt="Explain quantum computing in simple terms.",
)
print(response)
That is the whole thing. No cluster, no API key, no rate limits, no per-token bill quietly compounding in the background.
Where This Actually Matters
I run a data/AI company, and the moment I started touching real datasets — the 100-terabyte kind — the economics of frontier-model generative AI fell apart almost immediately. I remember pricing out a project that involved pinging roughly 100,000 call centers through a hosted LLM. The conversation about cost stopped being a footnote and became the project.
Now imagine a different shape of that same problem. A MacBook Pro, or a Mac Studio, running fifty-plus threads of a Bonsai-class model in parallel, with no meaningful change in power draw and no per-call invoice. Suddenly the workloads that were "impossible with generative AI" become a Tuesday afternoon job. The bottleneck stops being your AWS bill and starts being your imagination.
This is the part of the story that I think gets missed when people argue about whether small models can match the frontier on benchmarks. They don't have to. They have to be good enough to do useful work at a cost structure that lets you actually deploy them across millions of decisions.
What I Am Watching Next
A few questions I am turning over as I keep poking at this model:
How much further can we compress it? Bonsai is already small, but distillation and LoRA / QLoRA finetuning open the door to task-specific versions that might be smaller still — and meaningfully better at the narrow thing you actually care about.
Where does inference like this start to matter outside of text? Once you have a model this cheap to run, you can start putting genuine reasoning capacity inside systems that previously had to make do with hand-coded heuristics. Pathway decisions for drones. Terminal guidance logic for shells or missiles. Edge medical devices. The class of things where you cannot afford a round-trip to a cloud GPU, and where a few extra IQ points in the loop change the system's character entirely.
I do not have the answers to any of this yet. But I am increasingly convinced that the interesting frontier in language modeling for the next few years is not at the top of the parameter curve — it is at the bottom. The MacBook Pro running fifty Bonsai threads in the background is, I suspect, a much better preview of where this is going than another headline-grabbing trillion-parameter release.
More on the technical internals soon.
Originally published on Substack
Read on Substack →