If you’re a developer these days, there’s a good chance you’re using at least one AI CLI daily. Command-line tools that let you call AI models directly from the terminal have become the new productivity standard — moving well beyond browser-based chat interfaces. But spend enough time with any single tool and one thing becomes clear: no one model is best at everything.
Image: Unsplash
This post walks through how to set up Claude Code, Gemini CLI, and Codex CLI as a multi-model orchestration environment — covering installation and authentication for each, the actual invocation patterns, how to pass context between models, and the chaining scenarios I reach for most. Getting this set up takes an hour or two the first time, but once it clicks, the productivity difference is hard to ignore.
Where a Single Model Hits Its Ceiling
It’s no secret that different AI tools have different strengths. Some models are better at long-range reasoning and planning, some are unusually good at reading large documents accurately, and some are fast and precise at generating code. If you route the same task through different models, the quality gap becomes obvious fast.
Take a concrete example: you receive a 100-page PRD and need to implement a backend API from it. Using Claude Opus 4.7 alone means paying a significant context cost to load the whole document. Relying only on ChatGPT GPT-5.5 gives you fast code generation, but the model frequently misreads the overall structure in the early stages. Gemini 3.1 Pro alone handles document analysis brilliantly, but consistency drops on long code generation tasks. Hand the entire job to any single model and you pay in either cost or quality at some point in the pipeline.
The fix is simple: assign each step to the model that handles it best. That’s the core idea behind multi-model orchestration — less about exotic architecture, more about putting the right tool in the right slot.
Claude, Gemini, Codex — Who Does What
In my setup, the three models have clearly defined roles.
Claude is the control tower. It breaks down tasks, decides which model handles what, and reviews the outputs. Its strengths in planning, reasoning, and code review make it the natural orchestrator. Claude Code in particular handles file editing, tool use, and multi-step reasoning simultaneously, which makes it well-suited to managing the overall flow. It’s the most expensive of the three, but if you reserve it for decisions that actually require precise judgment, the cost justifies itself.
Gemini CLI takes on long documents and large files. Gemini 3.1 Pro’s context window runs to 1 million tokens, and its document parsing, summarization, and extraction capabilities are genuinely strong. It’s the obvious choice for loading an entire PRD or a large existing codebase and pulling out just what you need. The output pricing is also the lowest of the three, so batch processing doesn’t sting.
Codex CLI handles code generation and completion. Given a processed spec, it efficiently generates functions and modules or fills in boilerplate. It’s particularly well-balanced between response speed and code quality on targeted implementation requests. Running on the GPT-5.5 Codex variant, its code context awareness sits a level above standard GPT.
In one line: Gemini reads and organizes, Codex writes and fills, Claude judges and ties it together.
Image: Unsplash
Installing and Authenticating All Three
Before anything else: you’ll need Node.js 18 or later and Homebrew (macOS) or an equivalent package manager. All three tools run globally from the command line.
Claude Code installs with a single npm command:
npm install -g @anthropic-ai/claude-code
On first launch, type claude and the browser opens for authentication. You can log in with an API key from the Anthropic Console or with a Claude Pro/Max subscription account. Once authenticated, the token is stored locally in ~/.claude/ — no environment variables to manage.
Gemini CLI is Google’s official tool, also installed via npm:
npm install -g @google/gemini-cli
Running gemini for the first time opens the browser for a Google account OAuth flow. After that, the token is saved locally and subsequent calls just work. In environments where you can’t open a browser (CI, headless servers), you can set GEMINI_API_KEY with a key from Google AI Studio — but for a personal dev environment, direct login is far simpler.
Codex CLI is OpenAI’s official tool, available via npm or Homebrew:
# npm
npm install -g @openai/codex
# or Homebrew (macOS)
brew install codex
The authentication pattern is the same. Run codex or codex login and an OAuth flow for your ChatGPT account launches in the browser. A ChatGPT Plus or Pro subscription is all you need — no separate API key required, though the key-based flow is also supported if you prefer it.
All three follow the same pattern: install, do the browser OAuth once, and you’re done. No API key generation, no .env files to maintain. The credentials tie naturally to your subscription billing for each platform.
Chaining — Patterns for Connecting Models
The core of chaining is explicitly passing one model’s output as context for the next. Since each model only knows what’s in front of it, you need to be deliberate about what you include.
The pattern I reach for most is a four-stage flow: analyze → plan → implement → review. For building a new API, it looks like this:
# Stage 1: Gemini analyzes the PRD and extracts the spec
gemini -p "Analyze this PRD and extract the API endpoints, data models, and business logic. Output as markdown." < prd.md > spec.md
# Stage 2: Claude Code produces an implementation plan
claude "Read spec.md and suggest a step-by-step plan and file structure for implementing this API." > plan.md
# Stage 3: Codex generates the actual code
codex --context "$(cat spec.md plan.md)" "Based on the spec and plan above, generate the FastAPI endpoints and Pydantic models." > implementation.py
# Stage 4: Claude Code reviews the output
claude "@implementation.py Review this code and identify any edge cases or security issues."
When the planning stage is lightweight, I skip Claude and chain Gemini directly to Codex:
# Extract the core functions from a codebase, then generate tests
find src -name "*.py" | xargs cat | gemini -p "Extract the key functions and their signatures from this codebase." > functions.md
codex --context "$(cat functions.md)" "Write pytest unit tests for each of these functions." > test_suite.py
The reverse pattern is also useful — build the code first, then produce documentation from it:
# Implement first, document after
codex "Add a Redis-based caching layer to this GraphQL server." > cache_layer.py
gemini -p "Explain how this code works and write a technical blog post about it." < cache_layer.py > blog_post.md
The triangle pattern — routing through all three models — is what I use when output quality really matters. Gemini handles analysis, Claude handles design and review, Codex handles code generation, and Claude does a final pass on the combined output. It takes more time and costs more, but for production code it’s easily worth it.
Automatic Routing with a Shell Function
If manually specifying which CLI to call each time feels tedious, a shell function can route automatically based on the task type or input size:
# Add to ~/.zshrc or a separate script
ai() {
local prompt="$1"
local input="$2"
# Large file input → Gemini
if [[ -f "$input" && $(wc -c < "$input") -gt 50000 ]]; then
gemini -p "$prompt" < "$input"
# Code generation keywords → Codex
elif [[ "$prompt" =~ (write|implement|create|generate|refactor) ]]; then
codex "$prompt"
# Default → Claude
else
claude "$prompt"
fi
}
One or two functions like this smooth out the daily workflow considerably. If you want to go further, you can pipe outputs through jq for JSON post-processing, or integrate the whole chain into GitHub Actions for an automated PR review pipeline.
A more advanced option is Claude Code’s slash commands or custom skills. Define a /route command inside a Claude session and you can say “send this to Gemini” in natural language. Claude calls the external CLI, retrieves the result, and passes it forward.
Preventing Context Loss
The biggest pitfall in chaining is context leakage. If the spec extracted precisely in stage one doesn’t reach the code generation model intact in stage three, output quality drops fast.
Two principles prevent this. First, always write intermediate results to files. Passing data through shell variables is tempting but fragile — they have length limits and are prone to breaking on special characters. Write to a file, then read from it in the next stage.
Second, use explicit structural markers when handing off between models:
=== Previous Stage Output (Gemini PRD Analysis) ===
{spec content}
=== Next Stage Instructions ===
Using the spec above, write the FastAPI endpoints. Requirements:
- async/await throughout
- Pydantic v2 models
- Error responses following RFC 7807
With clear structure, the model never confuses what’s context and what’s a new instruction. The more stages you chain, the bigger the impact this structure has on final quality.
Third: validate at each step. Once you have an automated chain, it’s tempting to run it end to end and just look at the final output. When you’re first building the pipeline, check the output at each stage manually before moving forward. You’ll quickly identify which stage is degrading quality, and you can tune just that prompt rather than debugging the whole chain.
Keeping Costs and Speed in Check
Running multiple API calls per task raises a natural question about cost. In practice, routing each stage to the right model often makes the total cost lower than using a single expensive model end to end.
Long document analysis is where Gemini’s pricing advantage is most obvious. Fast, targeted code generation is Codex’s sweet spot. Expensive Claude Opus 4.7 stays reserved for the stages that genuinely need precise reasoning or review. In my experience over the six months since setting this up, the same output costs 30–50% less compared to running everything through Claude alone.
Speed also benefits from specialization. Running Gemini Flash variants for analysis in parallel while Codex handles code generation independently cuts total elapsed time, especially for tasks where analysis and implementation are cleanly separable.
For cost monitoring, the usage dashboards from OpenAI, Anthropic, and Google are a start, but checking them once a day is usually enough to catch unexpected token spikes. If you want fine-grained visibility into which stage is consuming the most, dropping in a usage tracking library like Helicone or LangFuse makes it quantitative.
A Few Practical Notes from Running This Daily
A few smaller things worth mentioning from actually running this setup day to day.
Prefer OAuth over API keys wherever possible. No key generation, no environment variables to manage, billing ties cleanly to your subscription, and there’s no key to accidentally leak. Save the key-based flow for CI environments where a browser can’t open — and in those cases, store the key in GitHub Actions Secrets or AWS Secrets Manager rather than in plain files.
Request markdown output from each CLI whenever possible. It’s easier for the next model to parse, and easier for you to read when reviewing intermediate results. JSON output is worth it only when you have a specific post-processing step downstream.
For long-running jobs, run them inside tmux or screen. If your SSH connection drops, the work continues; you can also kick off a job in the background and switch to something else while it runs.
Finally, this setup is not meant for every task. For a quick one-liner or a fast question, a single CLI is all you need. Multi-model orchestration earns its overhead on complex multi-step work or tasks with large context requirements. The tool is a tool, not a default.
The AI CLI ecosystem will keep expanding, and the era of picking one model and sticking with it exclusively is fading. Understanding each model’s strengths, knowing where to place them, and having the setup to automate that routing is becoming a baseline developer skill. It takes a couple of hours to set up the first time — but once you’re past that initial friction, the compounding effect on daily output is real. Putting the right tool in the right slot sounds obvious, but the gap it creates in the final work is anything but small.
Sources