Putting LLMs on the desktop: three Tauri + GGUF engineering pitfalls

Building LingoDesk and IRWorks, we keep running into the same technical decision: where should local LLM inference live, and how should it be delivered to the user? Here are the pitfalls we ran into.

Pitfall one: bundle size. A GGUF 7B model quantized to Q4 is about 4GB — roughly 10x the full installer of an Electron app. We ended up shipping just the framework in the installer and pulling the model on first launch, with resumable CDN downloads. First-run takes an extra 1-2 minutes, but routine upgrades stay in the seconds range.

Pitfall two: the cross-platform inference backend. llama.cpp goes straight to Metal on macOS; on Windows it's CUDA or CPU. We didn't pick one binary for all platforms — we ship different backend modules per OS and let the Tauri Rust side pick and load the right one.

Pitfall three: model updates. Users bought software, yet version bumps often come with model bumps. We decoupled model versioning from app versioning entirely: models have their own update channel and versions, so the app doesn't need a fresh installer every time.

Compared to Electron, Tauri's advantages in bundle size and memory footprint really show on desktop LLM workloads — every megabyte matters. If you're building something similar, we'd love to compare notes.

Putting LLMs on the desktop: three Tauri + GGUF engineering pitfalls

Got a hard problem on your roadmap?