Replacing Claude Haiku with Ollama gemma4:e4b — Web Search and File Exploration via Local LLM

At some point when you’re using the Claude Haiku API to filter web search results or narrow down relevant files in a codebase, a question creeps in: “Does this really need to go to a cloud API?” There’s also the nagging feeling that internal code or potentially sensitive file contents are passing through an external server.

Photo by Unsplash

Running gemma4:e4b locally through Ollama changed my perspective on this. For high-repetition, lightweight tasks, the quality gap with Haiku is barely noticeable — and the API cost is zero. This post is a record of that experience.

What gemma4:e4b actually is

gemma4 is the Gemma 4 series released by Google DeepMind in 2025. The e4b variant stands for “early 4-bit quantized” — it fits within roughly 4GB of memory. The quantization trades a bit of quality against a full-precision model, but it lands at a practical sweet spot for local inference. On Apple Silicon Macs, inference runs directly on unified memory without a discrete GPU. My setup is an M5 Pro with 48GB of unified memory. At that spec, the model loads in a few seconds and short responses come back in under a second. With a ~4GB model sitting in 48GB of unified memory, there’s no memory pressure at all.

Setup

On macOS, installing Ollama itself is a one-liner.

brew install ollama

Pulling the model is equally straightforward.

ollama pull gemma4:e4b

The download is about 4–5GB. After that, everything runs locally with no network dependency. Start the server with ollama serve, or just launch the app — it starts in the background automatically. The API is exposed as REST on localhost:11434 by default. It also supports an OpenAI-compatible endpoint at /v1/chat/completions, which means existing code needs almost no changes.

Calling it from Python looks like this:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="gemma4:e4b",
    messages=[{"role": "user", "content": "..."}],
)
print(response.choices[0].message.content)

Filtering web search results

The first thing I wired it up to was relevance filtering on web search output. The pattern: fire off a technical keyword search, get back 10–20 results, pass the whole batch to gemma4:e4b, and ask it to return only the links actually related to what I’m looking for.

In practice, I’d fetch search result JSON from the DuckDuckGo or Brave Search API, bundle each title and snippet together, and prompt the model with something like “from these results, pick the ones that contain real usage examples for the XXX library.” The results held up well. It reliably separated official docs, release notes, and actual code examples from SEO-bait and loosely related content.

The speed was what surprised me most. Even with 20 search results passed in one shot, responses came back in around 2 seconds on the M5 Pro. With a cloud API you’re paying for network round-trips plus token costs. Locally, none of that matters — you can run the same query a hundred times in a script without a second thought.

One thing that does matter: prompt structure. A vague instruction like “pick the relevant ones” produces inconsistent results. The more specific the criteria, the better the output. If you want JSON back, include an example format directly in the prompt.

Photo by Unsplash

File exploration and code search

The second use case I kept coming back to is navigating unfamiliar repositories. When you land in a codebase for the first time and want a quick map of what lives where, it works surprisingly well.

The workflow: dump the directory tree with find or tree, pass it to gemma4:e4b, and ask “where would the config files likely be?” or “which directory probably holds the routing logic?” Working from filename patterns and directory names alone, the model makes reasonable guesses. Feed it a slice of the relevant file and it gives a useful summary — “this looks like middleware configuration, and the auth logic appears to chain from here.”

Piping grep output into it is similarly effective. When a symbol search returns 20–30 lines, it’s tedious to scan manually and figure out which hits are definitions versus call sites. Paste the full grep output and ask “find just the definition location” — the model returns the file path and line number accurately more often than not. It does get it wrong sometimes, but it narrows down the first candidate far faster than reading through the raw output yourself.

The single most useful case was log analysis. Feeding it a few hundred lines of a stack trace and asking “summarize the lines most closely related to the root cause” produced output on par with what Haiku would give — core lines identified cleanly, noise filtered out.

Automating delegation with Claude Code skills

Rewriting a Python script every time you want to call local Ollama gets old fast. In practice, I use Claude Code’s skill system to define trigger rules once — essentially “for this class of task, route to Ollama” — and reuse them across sessions.

A skill is a Markdown file stored at ~/.claude/skills/<skill-name>/SKILL.md. The YAML frontmatter at the top defines the name and a description of when to invoke it. Claude Code reads that description during a conversation and loads the skill automatically when the conditions match.

My ollama-local-delegate skill looks roughly like this:

---
name: ollama-local-delegate
description: Use when facing repetitive file scanning, bulk text processing,
  pattern extraction tasks, or web research where preserving Claude's context
  window matters. Delegates grunt work to local Ollama LLM (gemma4:e4b) via API.
---

# Ollama Local Delegate

## When to Use

- Scanning dozens of files for patterns or classification
- Bulk text summarization or transformation
- Repeated binary judgments ("does this file match condition X?")
- Filtering web search results

## Core Pattern

python3 - /path/to/files/*.md <<'PYEOF'
import json, sys, urllib.request

for path in sys.argv[1:]:
    content = open(path).read()[:2000]
    payload = json.dumps({
        "model": "gemma4:e4b",
        "prompt": f"...",
        "stream": False
    }).encode()
    req = urllib.request.Request("http://localhost:11434/api/generate", data=payload)
    with urllib.request.urlopen(req) as r:
        print(f"{path}: {json.load(r)['response'].strip()}")
PYEOF

The description field is where most of the work happens. The more specific it is about when to trigger the skill, the more reliably Claude invokes it at the right moment. Write it too loosely and it fires on irrelevant tasks; too narrowly and it gets missed when it would actually help. Tuning that description over a few real uses is most of what skill authoring involves.

It’s also worth including explicit “don’t do this” notes in the skill body. For example, interpolating file contents directly into a curl command will silently corrupt the JSON payload — having that written down in the skill prevents running into the same issue twice.

Creating a skill from scratch

The mechanics are minimal:

mkdir -p ~/.claude/skills/my-skill-name
touch ~/.claude/skills/my-skill-name/SKILL.md

The file follows this structure:

---
name: skill-name
description: When to invoke this skill — describe the specific situation in plain language
---

# Skill Title

## When to Use
(conditions that should trigger this skill)

## Pattern or Instructions
(the procedure or code template Claude should follow)

No restart required. Claude Code picks up the new skill immediately and it’s available via the Skill tool during any conversation.

A practical tip: don’t try to write a perfect skill upfront. The better approach is to notice mid-task when you’ve written something you’ll want to reuse, then copy it into a skill file on the spot. Moving a working Python snippet into a skill takes two or three minutes. Once you’ve accumulated a handful of skills, routing repetitive work to the local model starts to feel natural rather than deliberate.

Where it falls short

The limits are real. Instruction-following precision is noticeably below Haiku. Complex multi-condition prompts or requests that chain several steps together tend to lose coherence partway through. The context window is also constrained enough that long files can cause the model to drop content from earlier in the input or ignore instructions that appear toward the end.

The operating principle that works: one task, one input, one output. Anything requiring nuanced judgment or complex reasoning goes to a cloud model. Repetitive tasks with clear boundaries stay local. Splitting responsibilities that way has meaningfully reduced API costs, and there’s a quiet satisfaction in knowing that internal code and files aren’t leaving the machine. It’s less about finding a drop-in replacement for Haiku and more about having the right tool for each job.

Sources