@adlrocha - Towards local plug-and-play AI
Local LLM inference optimisations: from attention mechanisms to predictive decoding and software-model-hardware implementations.
Last week I wrote about the hardware side of running AI locally, why memory bandwidth matters more than raw compute, which machines are worth building, and where the market is heading. If you missed it, start there as this post builds directly on top of it.
In the quest of becoming AI independent, your hardware sets the ceiling, but what decides how close you actually get to it is software.
Two machines with identical GPUs, identical VRAM, identical bandwidth, one running naive inference, one running an optimised stack can produce a 3-5x difference in tokens per second. This can mean the difference from running at 5tok/s to the 20-30tok/s that you need to get something usable. Even more, some techniques, and software-model-hardware optimised implementations may allow you to fit large models like DeepSeeekV4-Flash on a MacBook with at least 96GB of RAM. Same model, same hardware, different choices in the software layer (I feel we have a lot to learn from the video encoding and compression industry in this respect. We have to squeeze the most of every $ of hardware resources).
Last week I presented my new goal to create an inference box that generates tokens fast enough within a certain price point. So far I haven’t found one that fits my needs fully. What I am hoping is that the outcome of this work finds me a hardware configuration optimised for local inference and that is plug-and-play and not absurdly expensive, and/or a tool that detects your current hardware and suggests the best model and configuration for it.
This post continues that search focusing on the latest techniques to improve my inference stack.
MoE vs dense models
Before we get to the software tricks, there’s an architectural decision that sits underneath all of them, because it changes what the software layer has to deal with.
Most of the interesting models you’re likely to run locally with a decent throughput are Mixture-of-Experts architectures. Qwen3.6-35B-A3B, Qwen3-235B-A22B-250, DeepSeek-V4. The naming convention tells you the structure: 35B total parameters, but only 3B active per token. The model is divided into expert sub-networks, and a router that decides which ones fire for each token.
The key advantage of these types of models (and why is the one with biggest changes to fit your hardware) is that If only 3B of 30B parameters do any work per token, you get something close to 3B-scale inference speed while the model carries 30B-scale knowledge. With the right serving trick, like llama.cpp’s -ngl 99 -ncmoe 99 flags which keep the attention and shared weights on the GPU and offload cold expert FFN layers to system RAM, a Qwen3.6-35B-A3B can hit 33.5 tok/s on an RTX 3070 Ti with just 8GB of VRAM, provided you have 64GB+ of fast system RAM for the offloaded experts. The floor for running a 35B-knowledge model just dropped further than most people realise.
The main downside is consistency. And if you have used one of these MoE long enough you have probably experienced what I am about to describe.
Think of a MoE model as a hospital. Each patient gets routed to the right specialist. But which specialist fires depends on the token. The model can feel sharp in one domain and noticeably weaker in another depending on which expert activates. Dense models don’t have this problem because every parameter processes every token, every time. Slower, more expensive, but completely consistent. This lack of consistency can be experienced through tool call loops, to performance degradation and catastrophic forgetting.
For reasoning tasks, for long-context coherence, for anything where you need the model to stay sharp across a 50,000-token context, dense models tend to be better. This is why I would always recommend dense models for any agentic task that requires several assistant turns and accurate context keeping.
There’s a serving problem too: when too many tokens in a batch route to the same expert simultaneously, that expert’s buffer overflows and tokens get dropped silently. The model will not warn you of this happening, and it just gets worse. Dense inference has none of that complexity.
So how do you choose between dense and MoE models? Here’s the practical decision tree that I currently use myself:
8GB VRAM GPU + 64GB system RAM? MoE with expert offload is your only real option for a capable model. Qwen3.6-35B-A3B at Q4 or Gemma4-26B-A4B with llama.cpp offload fits this profile. Throughput will be CPU-bandwidth-bound, not GPU-bound, so a fast DDR5 system matters more than GPU generation here.
16–24GB VRAM (RTX 3090, RTX 4090, RTX 4080)? you have a genuine choice. Dense Qwen3.6-27B at Q4_K_M fits in ~16GB with no offloading, no serving complexity, consistent quality. MoE Qwen3-35B-A3B also fits at this tier with partial offload. If your workload is agentic, i.e. long contexts, tool calls, multi-step coherence, etc. The dense model’s consistency advantage is worth the slightly lower raw throughput (which can be painful as described in my last post).
128GB+ unified memory (Mac M3 Ultra, Strix Halo / Ryzen AI Max+)? This is where MoE becomes unambiguously better. You can hold the entire Qwen3-235B-A22B (235B total, 22B active per token) in unified memory and run it at full context without any offloading. At this tier you get frontier-class capability locally. The 22B active parameters per token still give you strong throughput on unified memory’s high-bandwidth pool. And sharp readers may be wondering, but then why are you running dense models like Qwen3.6-27B on your Strix Halo yourself? We are back to the consistency problem. I am looking to run long-running agentic tasks. The performance quickly collapses as the context grows.
Multi-GPU / 192GB+ VRAM? You can essentially run decent models for whatever you need. MoE at full precision, no quantisation required. The routing overhead becomes negligible at this level of parallelism.
The short version: if you have limited VRAM and a lot of system RAM, MoE with expert offload. If you have enough VRAM to fit a dense model cleanly, prefer dense for anything requiring long-context coherence. If you have a large unified-memory machine, MoE at the top tier, e.g. Qwen3-235B-A22B. becomes the most capable option you can run locally.
Fortunately, Alibaba built both ends of this spectrum with Qwen3.6, and so far I’ve been using them and I am decently happy. The MoE option is Qwen3.6-35B-A3B. The dense option is Qwen3.6-27B. Both run the same hybrid attention core, the choice between them is almost entirely a hardware and workload question, not a model quality one.
How did I get to these numbers? As I was doing the research for this section, I came across LocalMaxxing.com. This site is just pure gold. It provides benchmarks of different models over different hardware architectures, with clear information about the inference engine used and their configuration. This gave me homework for the next few months, and is a great resource for this new quest of mine.
The attention zoo
The MoE/dense question is only one dimension of the model architecture. The other is which attention mechanism the model uses internally. Every variant is a different answer to the same constraint: standard attention produces an N×N matrix where N is sequence length. At 10,000 tokens that’s 100 million entries, and the memory cost scales quadratically. Every interesting development in attention over the last two years is essentially a different attempt to escape that curve without paying too much in quality. Each have its own trade-offs and may fit better different underlying architectures.
Sebastian Raschka’s visual guide is the best resource I can recommend to navigate the attention landscape. The progression is worth following because each step reveals a different tradeoff.
Standard Multi Head Attention is what everyone (or maybe it’s me) thinks of when thinking about the attention mechanism in a transformer, where every token attends to every other token, full stop. Quadratic memory, quadratic compute. Nobody building for local hardware chooses it today; it’s just the baseline everything else is measured against.
GQA (Grouped-Query Attention) was the first fix that actually shipped at scale. Instead of each query head maintaining its own independent key-value projection, multiple query heads share a single KV pair. Roughly 50% KV cache savings, almost no quality loss. Llama 3, Qwen3, Gemma 3 all use it. The tradeoff is minimal, this is why it became the default so quickly.
MLA (Multi-Head Latent Attention), from DeepSeek, goes deeper. Rather than reducing how many KV pairs you keep, it compresses what’s stored in each one, saving a latent representation and reconstructing the full KV state on demand. More complex to serve than GQA, but at scale the quality-per-byte advantage is real. DeepSeek V3 and Kimi K2 use it; it’s the right choice when you have the memory to absorb the reconstruction overhead and want frontier-class output quality. Don’t ask me why, but I personally love models that implement MHA.
SWA (Sliding Window Attention) takes a different angle entirely. Rather than compressing the cache, it simply restricts how far back each token looks. Gemma 3 uses a 5:1 ratio of local to global layers, with a 1,024-token window and GQA on top. Memory grows linearly rather than quadratically, and for most practical workloads (like code completion, document Q&A, chat) the quality impact is minimal. The tradeoff is that you’re genuinely giving up global context on those local layers. Fine for most tasks; matters for very long-range reasoning.
Gated DeltaNet, used in Qwen3.6-27B, pushes further: rather than attending over a stored sequence at all, it maintains a fast-weight memory that gets continuously updated with each new token. Memory footprint stays flat regardless of sequence length. Going from 4k to 65k context costs ~800MB of VRAM instead of several GB, which is the difference between a 16GB card hitting a wall and staying competitive with machines three times its size. The tradeoff is architectural complexity and the fact that very-long-range dependencies that would be trivial for full attention require the model to have learned to compress them into the running memory state.
Mamba-2 hybrids (Nvidia’s Nemotron Nano) are the logical extreme, where most attention is replaced by recurrent state machines with constant memory regardless of sequence length. The right choice for edge and embedded hardware where even GQA is too much. Not the first option when you have a proper GPU available; the quality ceiling is lower, but the memory floor is the lowest in the landscape.
These differences show up on the machine. With SWA at long contexts the KV cache barely moves, the serving engine has headroom it wouldn’t have with MHA. Switch to GQA at the same context length and VRAM climbs noticeably. Run a DeltaNet hybrid and the memory profile nearly flatlines. When you’re fitting multiple agents on a fixed memory budget, which attention variant the model uses matters as much as how many parameters it has.
Even one more dimension on top of this that I’ve decided to leave out of this post and deserve its own post, is which quantisation mechanism to use and the different flavours available (which is a beast in itself). See my post about TurboQuant for a primer on this.
Speculative decoding
A few weeks ago Google announced that Gemma 4 was shipping with dedicated MTP drafters: small companion models that could push inference speed up to 3x without touching quality.
The problem speculative decoding solves is a fundamental one of transformers. As we’ve described a few times in this newsletter, LLMs generate text autoregressively, i.e. one token at a time, each token depending on everything before it. The large model has to do a full forward pass for every single token. You can’t parallelise the generation itself. So no matter how fast your hardware is, you’re paying the full model cost on every step.
The insight is that you don’t need the big model to propose tokens, only to verify them (a bit like speculative execution in processors). This is where the drafter comes in.
A drafter is a much smaller model (or a lightweight prediction head attached to the main model) that runs first and quickly guesses the next several tokens. Then the large target model takes all those guesses and verifies them in a single parallel forward pass. If the drafter’s guesses are right (and at 70-80% acceptance rates on typical queries, most of them are) you get several tokens confirmed in roughly the same time it would take the large model to generate just one. When the drafter is wrong, the target model corrects from that point and the drafter tries again. The output is mathematically identical to what the large model would have produced alone.
The drafter doesn’t need to be good in absolute terms. It just needs to be right often enough on the specific distribution of text the target model typically generates, and a small model fine-tuned on that same domain usually clears that bar easily. As mentioned above, we are getting closer and closer to the kind of optimisations that we made for encoders and processors.
Google’s MTP drafters for Gemma 4 achieve up to 3x speedup with no measurable quality degradation, roughly 2.2x on Apple Silicon at batch sizes of 4-8. Qwen3.6-35B-A3B with MTP enabled via vLLM hits 80 tok/s on 12GB VRAM, compared to 20-30 tok/s without. Using the same hardware by just setting a flag.
For models that don’t ship with native MTP heads, EAGLE-3 takes a slightly different approach: it attaches a lightweight prediction head directly to the target model’s internal layers rather than using a separate companion model. No extra weights to download or host, just a plugin that learns the target model’s output distribution and proposes candidates the main model verifies. Production benchmarks in vLLM and SGLang show 2-3x throughput gains at low-to-medium concurrency.
One caveat: the gains are real at 1-10 concurrent requests but diminish as concurrency rises. At high concurrency the overhead of running the drafter plus verification starts eating into the benefit. For a personal inference box or a small team setup, it’s the highest-return optimisation on this list, but it may not be a silver bullet as soon as you want to serve this to several concurrent users (not my case at least for now).
FlashAttention
We’ve touched on this a few times in this newsletter. The KV cache problem has a structural cause, and as your context requirements grow, this is something that is immediately noticeable (try a Qwen3.6-27B model running in a high-end Strix Halo using Hermes agent’s minimum 65K recommended context, and feel the pain as the context grows and your conversation evolves). Standard attention produces an N×N matrix where N is sequence length. At 10,000 tokens, that’s 100 million entries.
This is why you should always have FlashAttention enabled. First published by Tri Dao’s lab this technique fixes the tiling of computing full KV cache blocks. The model never materialises the full N×N matrix, it computes attention block by block, keeping intermediate results in fast memory. Same mathematical result. Dramatically fewer memory round-trips (enable -fa in lamma.cpp, or the equivalent of your inference engine of choice, to see the magic happens).
Liquid Nanos: a different foundation
Everything above is about making transformers run better with the hardware available, but there’s one approach worth knowing about and that I recently came aware of and aligns with a hypothesis I’ve been having for a while.
Liquid AI built their models on Liquid Time-Constant Networks, a class of neural network originally inspired by C. elegans, the nematode with 302 neurons that navigates and forages with surprising reliability. The underlying idea is the following: biological networks don’t process information in discrete steps, they evolve continuously according to differential equations. The time constant that governs each neuron’s behaviour adapts based on current input. The model isn’t static and it adjusts its own processing as it reads.
The production translation is the LFM2 family: a hybrid with 10 gated short-range convolution blocks and 6 grouped-query attention blocks. Most sequence processing happens in the convolution layers, which scale linearly with context length and carry no KV cache overhead at all. Liquid calls the unifying design ‘Linear Input-Varying operators’, weights generated on-the-fly from input, rather than fixed parameters. And the architecture was found by optimising under real CPU and mobile SoC latency constraints from the start. Hardware-in-the-loop search, not adapted for hardware after the fact.
LFM2.5-1.2B hits 2,975 tok/s prefill and 116 tok/s decode on an AMD Ryzen AI 9 laptop via llama.cpp without GPU. The Liquid Nano models, task-specific fine-tunes for extraction, RAG, and summarisation, range from 350M to 2.6B. The 350M variant runs on a Raspberry Pi 5 in 300MB at int8 quantisation.
The standard instinct with LLMs is to reach for the biggest model you can run. But the Nano line is a concrete example of the alternative: distil the capability you actually need into a model small enough to run anywhere, then compose several of them. A 350M extraction model pulls structured fields from a document. A small RAG model retrieves relevant context. A slightly larger reasoning model makes the decision. Each one doing its piece efficiently, orchestrated into a pipeline, and the total compute cost is a fraction of a single large generalist doing everything at once.
This is, I think, one of the more underrated directions for local inference as the hardware improves. The question stops being “can I fit a big enough model on this machine?” and starts being “what’s the right composition of narrow models for this workflow?” The distillation approach isn’t new but the combination of capable base models to distil from, hardware that can run the small results locally, and serving stacks that already support them changes what’s practical.
This is the approach that we used to build Baselight AI: leverage narrow agents that can run leveraging smaller models, and compose their functionality to achieve the global goal.
Software-Model-Hardware Optimised Implementation
I want to close by sharing something that blew my mind, and that I am hoping to study in detail and come back to soon. An example of optimising the model to the underlying hardware. It reminds me of my time researching video compression. The author of Redis published ds4: a native inference engine for DeepSeek V4 Flash, written specifically for Apple Silicon in a few thousand lines of C. No generic inference engine, no GGUF general-purpose runner abstractions. One model, done properly and over-optimised for a specific hardware architecture.
It runs DeepSeek V4 Flash with a 1 million token context window on a 128GB MacBook Pro. I haven’t tried it myself, but antirez says that “coding agent works great, reliably calls tools.”
The performance numbers from the repo are the following: 26.68 tok/s generation on an M3 Max 128GB. 36.86 tok/s on an M3 Ultra 512GB. Not the fastest numbers you’ll find in this series of posts. But for a coding agent running a full tool-call loop locally, with 1M context, on a laptop that costs less than a used RTX 3090 rig, it’s more than enough. I think this setup would solve my needs
The three engineering decisions made this possible are the following:
Asymmetric 2-bit quantisation. The MoE experts make up roughly 90% of DeepSeek V4 Flash’s parameter volume, but they’re not all equally critical. Antirez applies an aggressive 2-bit compression (IQ2_XXS and Q2_K) only to the routed expert layers, the ones that activate a subset of the model per token. Shared experts, projections, and the routing logic itself stay at full precision. You get most of the memory savings with a fraction of the quality loss. This is the same asymmetric logic that makes MoE offloading work in llama.cpp, pushed further.
KV cache to SSD. The conventional assumption is that the KV cache has to stay in RAM, at 1M tokens it would blow well past 128GB. Antirez treats the compressed KV cache as something that can live in disk: checkpointed to Apple’s high-speed SSD, with SHA1-based resumption across restarts. The engine tracks what’s in memory, what’s on disk, and fetches as needed. It works because Apple Silicon’s NVMe storage, the same architecture I mentioned in the my Apple post, reads at speeds that make SSD-as-extended-memory viable in a way it simply isn’t on a standard laptop.
Pure Metal, no wrappers. Every kernel hand-written for Apple Silicon. No intermediate runtime, no portability tax. The result is that the Metal backend isn’t an afterthought, it’s the target the entire design was optimised for.
What I really love about this work on ds4 is that it aligns this thesis I’ve been having that we need software-mode-hardware optimised implementations (it may be confirmation bias, I’ll take it). But it takes what we’ve been exploring throughout the post to its logical extreme. FlashAttention improved on standard attention by understanding exactly what the hardware does with memory. Speculative decoding improves throughput by understanding how to exploit the verification step. Liquid’s hardware-in-the-loop search improves inference by designing the architecture for the hardware constraint from the start. Antirez did all three things at once, in C, for one specific model, and published it in a few thousand lines.
I feel that we are going to see more and more of this (I personally want to do the exercise of implementing something like this for one of my favourite small models). Specific models optimised for inference in specific hardware. This is what I was referring to when I mentioned that if I can’t get a plug-and-play box for general-purpose inference, at least I want specific usable models that fit my hardware and my use case.
Towards plug-and-play local AI
Last week I wrote about the hardware side of AI independence, focusing on the setup and machines that made local inference viable without spending months of work and breaking your bank account. This week it’s the software side: the optimisations and model choices that turn decent hardware into something that feels fast and usable.
The truth is, the pieces are already here. MoE expert offloading, speculative decoding (MTP or EAGLE-3), FlashAttention, DeltaNet hybrids, narrow distilled models, everything we need exists today in open source. What I feel is missing is the glue: one opinionated stack that just works for your specific machine.
YC entrepreneurship gurus always say that you should build things that scratch your own itch. Plug-and-play local inference on a budget is that itch that I’ve set myself to scratch in the coming months (or years?). I want a tool that looks at your hardware and gives you the best possible setup: right model, right quant, right serving engine, right flags, etc. without you having to navigate the local inference landscape. No more weekend debugging sessions. Just install once and run capable agentic AI locally and a clear expectation of what your hardware will allow you to run (because I’ve also been burnt by false promises of how “useful” a local model could be running on specific hardware).
This lines up directly with the bigger fight @0xSero laid out: making AI education, training, self-hosting, and inference available to everyone on this planet, whatever your budget. Open source has to win. The alternative is locking frontier capability behind cloud subscriptions and six-figure rigs.
I’m working towards this starting with serious benchmarking on the Strix Halo (because it’s the hardware that I have). The more real data we have, the better the auto-config can become.
I’m completely open to collaborations on this. If you have hardware to test on, a rig or box that you don’t use anymore and that you would be willing to donate for the cause, engineering time, or want to help push the benchmark suite forward, reach out.
Until next week!








On Sunday I write about speculative decoding, and immediately we get Qwen3.6 with MTP and support for llama.cpp: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF
I just tested it and it looks really promising. I'll report back with some numbers.