Moonshot Kimi K2.5 on OpenRouter
The multimodal coding heavyweight, with the Fireworks routing play
Moonshot AI's Kimi K2.5 delivers a multimodal coding model that outperforms top proprietary systems on key benchmarks, with OpenRouter making it practical at roughly sixty cents per million input tokens (and real-world blended costs with caching closer to $0.30 / M tokens).
See my video discussion on this model and my workaround to use it with custom provider routing in Claude Code and other agent harnesses:
The new heavyweight contender
Kimi K2.5 arrived as a “use this right now” model for agentic coding workflows: long context, true multimodal inputs, and a cost structure that makes high-token loops feel economically sane.
Key specs that matter in practice
Kimi K2.5 is positioned as Moonshot AI’s native multimodal model, built on Kimi K2 and continued pretraining over an additional 15 trillion mixed visual and text tokens, with an emphasis on visual coding and agentic tool use. It ships with a 262,144-token context window on OpenRouter, which separates “toy demo” from “real repo scale” for many agent harnesses.
The multimodal capability here is a technical leap for Kimi. Moonshot’s own framing (and early reporting) centers on feeding screenshots or videos and asking the model to recreate interfaces and interactions. That workflow describes exactly the kind of input that front-end work usually struggles to express in pure text.
The benchmarks people keep screenshotting
Kimi K2.5 shows up strongly on coding and multimodal leaderboards that many teams already track.
SWE-bench Multilingual: reported as scoring higher than GPT-5.2 and Gemini 3 Pro.
Humanity’s Last Exam (HLE): reported as surpassing GPT-5.2 (xhigh) and Claude Opus 4.5 when tools are available.
VideoMMMU: reported as beating GPT-5.2 and Claude Opus 4.5 on video reasoning.
These results explain the “leaderboard disruption” feeling. K2.5 performs well without needing a niche scenario to look good.
Price that changes the workflow
The OpenRouter model card lists Kimi K2.5 at $0.50 per million input tokens and $2.80 per million output tokens. Fireworks’ pricing shows $0.60 per million input tokens, $0.10 per million cached input tokens, and $3 per million output tokens.
Compare that to the usual default choices:
Claude Sonnet 4.5: $3 per million input tokens and $15 per million output tokens.
Claude Opus 4.5: $5 per million input tokens and $25 per million output tokens.
Even with the $0.60 figure, K2.5 lands around 5x cheaper than Sonnet on both input and output, and around 8x cheaper than Opus on input. That cost gap lets you run agents “wide” and “deep” without immediately developing a billing-induced personality disorder.
Kimi K2.5 is also available on OpenRouter right now, which matters because it turns a model release into something you can actually route into your existing stack in minutes.
Geography and tooling
The “China constraint”
Moonshot AI is a Chinese company headquartered in Beijing. For many developers and companies, that immediately triggers policy constraints around where prompts, code, documents, and customer context can be sent.
Teams can admire Kimi K2.5’s performance and still have to pass because their compliance posture treats direct calls to certain endpoints as a nonstarter.
The provider-level solution
OpenRouter offers a clean abstraction layer: one API surface, multiple providers, and a routing configuration that can select which upstream actually serves the request.
If your concern is “I want access to these weights, but I want inference served through infrastructure that fits my org’s data handling requirements,” provider routing is the lever you need.
Fireworks is a US-based company with headquarters in Redwood City, California, making it a natural target for teams that want a US inference footprint while still using K2.5.
Getting your harness to obey provider routing
OpenRouter supports a provider object in the request body for chat completions, including fields like order and allow_fallbacks. Many agent harnesses expose a way to add extra_body JSON. Some do not. Some support it partially. Some support it until tool calls appear and then everything turns into a haunted house.
The model is easy to access, the routing knob exists, and your tooling sometimes refuses to grab it.
Setting up the agents
Before the harness-specific recipes, here is the core idea you are trying to express to OpenRouter:
{
“model”: “moonshotai/kimi-k2.5”,
“messages”: [{”role”: “user”, “content”: “...” }],
“provider”: {
“order”: [”fireworks”],
“allow_fallbacks”: false
}
}That provider object is first-class in OpenRouter’s routing docs.
OpenCode (native support)
OpenCode can lock OpenRouter routing at the config layer, so you can pin K2.5 to Fireworks without duct-taping a proxy.
In ~/.config/opencode/opencode.json, use a model-specific provider policy:
{
“provider”: {
“openrouter”: {
“models”: {
“moonshotai/kimi-k2.5”: {
“options”: {
“provider”: {
“order”: [”fireworks”],
“allow_fallbacks”: false
}
}
}
}
}
}
}OpenCode exposes the routing surface area directly. They are also promoting the model with a limited-time free use:
OpenHands (LiteLLM extra-body injection)
OpenHands uses LiteLLM SDK, which provides a convenient place to inject OpenRouter’s routing object via an environment variable.
A minimal setup looks like this:
export LLM_LITELLM_EXTRA_BODY=\
'{"provider":{"order":["fireworks"],"allow_fallbacks":false}}'
export LLM_API_KEY="$OPENROUTER_API_KEY"
export LLM_MODEL="openrouter/moonshotai/kimi-k2.5"
openhands --override-with-envsOpenHands can express the right request body this way.
Claude Code (the shim workaround)
Claude Code speaks in Anthropic Messages API terms. That interface gives you no clean hook for OpenRouter’s per-request provider routing object. The shim exists for exactly this reason.
The shim runs locally and injects the provider policy server-side.
Start the shim:
export OPENROUTER_API_KEY=”sk-or-v1-...”
export ANTHROPIC_MODEL=”moonshotai/kimi-k2.5”
npx openrouter-provider-shim --provider-only fireworks
# In another shell, point Claude Code at the shim:
export ANTHROPIC_BASE_URL=”http://127.0.0.1:8787”
claudeThe shim also supports “smart substitution,” where it detects an inbound Anthropic-style key and swaps to your OpenRouter key when OPENROUTER_API_KEY is present.
Claude Code becomes a drop-in high-throughput Kimi K2.5 client with Fireworks pinned as the upstream provider.
Droid from Factory.ai (shim for reliable tool calls)
Factory’s Droid supports custom model configs that can target OpenRouter, including an extraArgs section where a provider policy can live. The shim repo includes an example config that expresses Fireworks pinning in that native format.
In practice, the shim route remains the stable path when tool calls misbehave, because the shim presents an OpenAI-compatible /v1 surface and injects routing in a consistent way.
The shim repo’s recommended approach:
Point Droid’s custom model baseUrl to the shim. In ~/.factory/settings.json:
{
“model”: “moonshotai/kimi-k2.5”,
“id”: “custom:Kimi-K2.5-[shim->-Fireworks]-2”,
“index”: 2,
“baseUrl”: “http://127.0.0.1:8787/v1”,
“apiKey”: “sk-or-v1-REDACTED”,
“displayName”: “Kimi K2.5 [shim -> Fireworks]”,
“maxOutputTokens”: 131072,
“noImageSupport”: false,
“provider”: “generic-chat-completion-api”
}Start the shim:
export OPENROUTER_API_KEY=”sk-or-v1-...”
npx openrouter-provider-shim serve --port 8787 --provider-only fireworks
# In another shell, run droid:
droidThat config is included as a working pattern in the shim repo docs.
Beating rate limits and costs
Rate limits
When a new model becomes the hotness, shared capacity gets slammed. The shim explicitly calls this out and includes automatic retry logic for Claude Code style traffic, with escalating backoff delays.
That retry logic keeps long agent runs from dying to transient 429s (HTTP errors indicating rate limits), especially when your harness insists on firing requests like a metronome.
The capacity hack with BYOK
The cleanest way to increase headroom is to add your own Fireworks key to OpenRouter’s “Bring Your Own Keys” integration. OpenRouter prioritizes provider keys when available, and it supports combining BYOK with provider ordering so your preferred route gets tried first.
BYOK shifts the situation from “shared pool” to “my account’s capacity,” which is exactly what you want when running high-throughput agent loops.
Why the blended cost can look lower than list price
Fireworks publishes a cached-input price for K2.5 of $0.10 per million tokens. If your workflow has stable system prompts, repeated repo context, or long-running agent sessions with a lot of overlapping prefix tokens, cache hits can drag the effective input cost down meaningfully.
You can see a blended input/cache/output cost closer to $0.30 per million tokens in real usage logs, even when the list input price says sixty cents. A healthy fraction of your “input tokens” can bill at the cached rate.
Vibe check
Kimi K2.5 combines three traits that rarely show up together: strong coding benchmarks, real multimodal inputs (including image and video), and pricing that makes long-context agent work feel routine.
I used Kimi Code through the VS Code extension to develop the shim, and had a delightful experience.
Pin K2.5 to Fireworks when your data residency posture calls for it, use native provider routing where your harness supports it, and use the local shim when your harness cannot express OpenRouter’s provider object cleanly.
For the shim, installation and execution is as simple as:
npx openrouter-provider-shim --provider-only fireworks







Smart to route through OpenRouter rather than hitting Moonshot's API directly. At least that way your data goes through OpenRouter's pipeline first. Kimi's privacy policy lets them train on everything you send though. I looked into the specifics and it's a pretty different story from what Anthropic or even some other Chinese providers do: https://reading.sh/which-ai-providers-wont-train-on-your-data-e38280ff9887
The Fireworks routing tip is solid. I've been running K2.5 through Synthetic instead for the flat-rate pricing — no per-token anxiety during long agentic loops. OpenCode's provider config makes the swap trivial. Wrote up the full setup end-to-end here: https://reading.sh/the-definitive-guide-to-opencode-from-first-install-to-production-workflows-aae1e95855fb