engineering data March 17, 2026

Why Half of MCP Servers Fail

615

MCP servers

456K+

health checks

—

avg failure rate

We ran 1,677,644 health checks on 615 MCP servers. The average reliability score across the ecosystem is 46.8%. That means roughly half of all MCP servers you might integrate with are failing a significant portion of checks.

Here's what the data actually shows — broken down by failure type, time of day, and what separates the servers that stay up from the ones that don't.

Section 01

The Failure Modes

Not all failures are equal. We categorize errors from the mcp_health_checks table into four buckets — and the distribution tells a story about why servers fail, not just that they fail.

Connection Timeout 0.0%

Server accepts the connection but never responds within the SLA window.

Connection Refused 0.0%

TCP handshake immediately rejected. Process is down, port not listening.

5xx Server Error 0.0%

Server responds but with an error — unhandled exceptions, OOM crashes, bad deploys.

DNS / Network Failure 0.0%

Hostname doesn't resolve. Domain expired, misconfigured, or infrastructure removed.

Other / Unknown 100.0%

SSL errors, rate limiting, auth failures, malformed responses.

"Timeouts are the silent killers. Unlike connection refused errors — which fail fast — timeouts burn your agent's time budget waiting for a response that never comes."

The timeout dominance matters for AI agent design. When your Claude agent calls a tool backed by an unreliable MCP server, a timeout doesn't just fail — it burns context window and latency budget before it fails. A connection refused at least fails in milliseconds.

The DNS failure category is interesting: these are essentially dead servers. Once a domain goes unresolvable, it never comes back. These servers inflate the "unreliable" numbers but represent a distinct failure class — abandoned infrastructure, not operational instability.

Section 02

When Servers Fail (Hour by Hour)

We grouped all failed checks by the hour they occurred (UTC). The pattern is consistent across 30 days of data: failure rates spike during specific windows that map almost exactly to US business hours.

Failure Rate by Hour (UTC) — 30-day average

100%

0

100%

100%

100%

100%

100%

100%

6

100%

100%

100%

100%

100%

100%

12

100%

100%

100%

100%

100%

100%

18

100%

100%

100%

100%

100%

00:00 UTC 12:00 UTC 23:00 UTC

High failure rate

Low failure rate

The peak failure window — typically 00:00, 01:00, 02:00 UTC — corresponds to US East Coast morning through afternoon (8am–2pm ET). This is when:

→Developer traffic spikes as teams start work
→CI/CD deploys happen (introducing bad code or missed env vars)
→Free-tier cloud services hit daily compute quotas
→Shared infrastructure becomes contended

The overnight dip (02:00–08:00 UTC) is genuine stability — fewer deployments, lower load. If you're scheduling agent tasks, run them at 03:00–07:00 UTC for the lowest failure probability.

Section 03

What the Reliable 10% Do Differently

We compared the top 10 servers by reliability score against the bottom 10. The differences aren't surprising, but they're stark.

Metric

Top 10 ✓

Bottom 10 ✗

avg reliability score

95.0%

0.0%

avg uptime (30d)

95.0%

0.0%

avg response time

0ms

106ms

score trend

improving / stable

declining

Top 10 Most Reliable

#1 mcp-use 95.0% 0ms

#2 pal-mcp-server 95.0% 0ms

#3 notion-mcp-server 95.0% 0ms

#4 mindsdb 95.0% 0ms

#5 playwright-mcp 95.0% 0ms

#6 Figma-Context-MCP 95.0% 0ms

#7 genai-toolbox 95.0% 0ms

#8 casdoor 95.0% 0ms

#9 mcp-chrome 95.0% 0ms

#10 inspector 95.0% 0ms

Why Half of MCP Servers Fail

The Failure Modes

When Servers Fail (Hour by Hour)

What the Reliable 10% Do Differently

Top 10 Most Reliable

Bottom 10 — Lowest Reliability

Check Before You Integrate

Reliability Score

Trend Direction

Community vs Official

Check any MCP server before you integrate

Methodology

Weekly MCP reliability digest

Related