Local AI has had a credibility problem. The models were capable enough for demos but fell short for real work. You'd run something locally, get a mediocre result, and quietly go back to Claude or GPT-4.
Gemma 4, released April 2, changes that calculus.
This isn't a marginal improvement on Gemma 3. It's a different class of model — 4x faster, 60% less battery, a 256K context window, and native multimodal support out of the box. The gap between running locally and running in the cloud just got significantly smaller.
The specs that matter
Speed: 4x faster than Gemma 3. This isn't a benchmark quirk — it's a real difference in day-to-day usability. Waiting two seconds for a response is fine. Waiting eight is not.
Battery: 60% less power consumption. If you're running inference on a laptop, this is the difference between viable and impractical. Gemma 3 would toast battery life on sustained use. Gemma 4 doesn't.
Context window: 256K tokens. That's long enough to load an entire codebase, a full research paper, or a lengthy document without chunking. Context window has been the limiting factor on useful local AI work — 256K removes it for most real tasks.
Multimodal: Native vision and audio processing built in. Not bolted on, not via a separate model — native. Send it an image and ask a question. Feed it audio. The model handles both.
Languages: Fluent in 140+. This matters if you're building tools for non-English content or working across multiple markets.
The model sizes
Gemma 4 ships in four variants:
| Model | Best for |
|---|---|
| E2B | Edge devices, mobile, low-power hardware |
| E4B | Laptops, consumer GPUs, mid-range local inference |
| 31B | Workstation-class hardware, serious local inference |
| 26B A4B | Mixture-of-experts architecture, efficient at scale |
The E2B and E4B are the interesting ones for most people — they're designed to run on hardware you actually have rather than hardware you'd need to buy.
The licence
Apache 2.0. Commercially permissive. You can use it in products you sell. You can fine-tune it, deploy it, build on top of it without asking Google's permission or paying a licence fee.
This matters more than it sounds. A lot of "open" models come with non-commercial restrictions or require attribution in ways that make commercial use awkward. Gemma 4 doesn't. It's actually open in the way that's useful.
What this means for local vs cloud AI
The honest framing until recently was: use cloud AI (Claude, GPT-4, Gemini) for anything where quality matters, and local models for experiments or privacy-sensitive tasks where "good enough" was acceptable.
Gemma 4 blurs that line. A 256K context window with native vision and audio at 4x the previous speed starts to compete with cloud models on capability, not just privacy. The intelligence-per-parameter efficiency — Google's framing, but it's accurate — means you're getting significantly more from the same hardware.
For tasks where you want:
- Privacy — nothing leaves your machine
- Cost control — no per-token billing
- Latency — no round-trip to a cloud API
- Offline capability — works without internet
...Gemma 4 is now a genuine first-choice option rather than a fallback.
Where to get it
Gemma 4 is available through Google AI Studio, Hugging Face, and Kaggle. If you're using Ollama for local inference, models typically appear within days of a major release — check the Ollama model library.
The Gemma series has had over 400 million downloads and 100,000 community variants since launching in 2024. The ecosystem around tooling, fine-tuning resources, and integration guides is substantial.
If you haven't revisited local AI since Gemma 3, this is the update worth picking back up.