XFMS
XF Model Source

Pick the right LLM for every task — quality, cost, latency, and capability fit in a single ranked answer.

AI on its own predicts the pattern — what model do most people pick for queries that look like this? — and gives you a plausible-sounding answer that's often wrong for *you*. XFMS does the decompression instead. It takes your stated purpose, infers which quality benchmarks matter, applies your stated capability requirements, ranks every model in the catalog against those weights, and returns a finite shortlist with plain-English rationale per pick. No provider self-reports. No single-source benchmarks. Continuous catalog updates from eight independent evaluators. The model-selection layer beneath every other XF module.

Part of Model Source

Connect with GitHub — it's free

Install

Stop guessing which AI model is right for your task. Once XFMS is connected to your AI assistant, you just ask — “which model should I use for [whatever]?” — and it comes back with a ranked answer plus the reasons each model is on the list. No coding, no benchmarks to read, no leaderboards to compare.

Step 1 — Get your free token (1 minute)

Go to xpansion.dev/xfms/get-started. Enter your email, click the link in the confirmation email, your access token arrives in a second email. Looks like xfms_live_… — keep it handy.

Step 2 — Connect it to your AI assistant

Pick the tool you use:

Claude Code — open your terminal and paste this, replacing the token at the end with yours:

claude mcp add xfms --transport http https://xfms.vercel.app/mcp/ \
  --header "Authorization: Bearer xfms_live_your_token_here"

Cursor — open Settings → MCP, paste this:

{
  "mcpServers": {
    "xfms": {
      "url": "https://xfms.vercel.app/mcp/",
      "headers": {
        "Authorization": "Bearer xfms_live_your_token_here"
      }
    }
  }
}

Cline, Continue, Claude Desktop, or any other AI tool that supports MCP — same URL, same header. Each tool's config layout is slightly different; check theirs.

Step 3 — Ask

Restart your AI tool if it asks you to. Then just talk to your assistant the way you already do:

  • “Use XFMS to pick a model for OCR on handwritten shipping manifests.”
  • “Use XFMS — which is the cheapest model that can summarize a 50-page contract?”
  • “Run XFMS’s A/B test on the top 3 models for writing emails.”

No OpenRouter key required. The hosted endpoint covers the small inference call XFMS makes internally. Your free XFMS access token is all you need.

What it does

1

Real-world A/B probes — see how the picks actually behave on your kind of query.

Add a flag and XFMS runs the top 3 picks against 5 generated test queries (expanding to 10 or 15 if the picks trade wins) and surfaces real-world cost/latency stats plus plain-English commentary about who won what. You stop guessing whether the leaderboard pick is actually right for your workload.

2

Priority tiers — when you say 'cheapest, period', cheapest wins.

User-stated preferences are sacrosanct. Mark a dimension as primary and the engine switches from weighted-sum blending to lexicographic ordering: the primary dimension is the sole ranking axis, and other dimensions only break ties. 'I want the cheapest model that can parse a PDF' actually picks the cheapest. No silent dilution.

3

Latent-requirement detection — catches what you didn't ask for but probably need.

If your purpose implies real-time chat, voice, or streaming output and you didn't ask for streaming capability, XFMS surfaces a latent-requirement suggestion at the top of the response. Accept and re-run, or ignore — but you'll never silently get a non-streaming pick for a streaming use case.

4

Hosted MCP — one-line install, no key required.

Point any MCP host (Claude Code, Cursor, Continue, Cline) at https://xfms.vercel.app/mcp/ and you're ranking models from the chat. No pip install, no OpenRouter key. The hosted endpoint pays for the small inference call XFMS makes internally — and when your host supports MCP sampling, that call routes through your host's LLM and nobody pays.

5

Honest gaps over invented signal.

Missing benchmark data is recorded as missing. No interpolation. No synthetic scores. Coverage gaps surface on every pick so you know what the system doesn't know — instead of trusting a confident-sounding answer over thin evidence.

How it works

1

You state a purpose — concrete, not vague. 'Fixing bugs in a Python codebase' works; 'coding' does not.

2

XFMS routes through the discovery tree: capability requirements, then quality / cost / latency / privacy weights.

3

A small LLM call infers which benchmarks matter for your purpose (factuality, instruction-following, code reasoning, etc.) — your stated leaf priorities always override inferred ones.

4

Every model in the catalog is scored against those weights. The catalog is continuously updated from eight third-party evaluators — no provider self-reports.

5

A finite shortlist comes back with weights, scores, plain-English rationale per pick, and any coverage gaps.

6

Optional: --ab probes the top 3 picks against generated test queries and returns real-world cost/latency plus commentary on who won.

● ● ●
◎ XF Model Source (XFMS) ↳ writing a tight editorial under a budget
Inferred weights: prose_quality 0.9, factuality 0.7, cost 0.6
Top picks:
1. anthropic/claude-haiku-4.5 0.91 $0.001/1k fast
2. openai/gpt-5.5-mini 0.87 $0.0008/1k fast
3. google/gemini-3-flash 0.84 $0.0007/1k fast
◎ XFMS Coverage gap: 1 benchmark missing for gemini-3-flash (instruction-following)
Other modules