How we test
AgentJury runs five automated test suites against every MCP server and agent skill. Each produces structured data, not prose. Scores are weighted averages of measurable outcomes.
Security (25% of score)
Input fuzzing with 50+ payloads: SQL injection, path traversal, command injection, prompt injection. Permission escalation attempts. Secrets scanning in source code.
Reliability (25% of score)
100 calls with varied inputs: 60 valid, 20 edge cases, 20 intentionally malformed. We measure success rate, p95 latency, and whether error messages are parseable by an agent (not just readable by a human).
Agent usability (20% of score)
An AI agent receives only the tool's published description and tries to complete 10 tasks. No documentation, no examples beyond what the tool itself provides. First-try success rate and common failure modes are recorded.
Compatibility (15% of score)
The tool is tested against Claude Code, OpenAI Agents SDK, and LangChain. Pass, fail, or partial for each, with notes on what breaks.
Code health (15% of score)
Static checks: days since last commit, open issue count, test coverage, dependency freshness, license type, number of active contributors.
What we test
MCP servers
Model Context Protocol servers from the official repository, cloud providers (AWS, GCP, Cloudflare), and the community. Tested for transport compatibility, tool schema quality, and runtime behavior.
Agent skills
Skills from OpenClaw (ClawHub), skills.sh, and the Anthropic skills repository. Tested for instruction quality, security implications, and whether an agent can follow them on first try.
Score thresholds
Recommended
Reliable, secure, well-maintained
Acceptable
Works but has notable issues
Use with caution
Significant reliability or security concerns
Not recommended
Critical issues found