Evaluation Criteria
This skill generates tailored evaluation criteria for projects or tasks, ensuring objective assessments and improved outcomes.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add evaluation-criteria npx -- -y @trustedskills/evaluation-criteria
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"evaluation-criteria": {
"command": "npx",
"args": [
"-y",
"@trustedskills/evaluation-criteria"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill provides evaluation criteria for assessing the performance of AI agents. It allows users to define specific metrics and thresholds, enabling a structured approach to evaluating agent behavior and outputs. The skill can be used to determine if an agent meets predefined standards or identify areas needing improvement.
When to use it
- Performance Review: Evaluate an agent's response quality against established benchmarks for customer service chatbots.
- A/B Testing: Compare the effectiveness of two different AI agents by applying consistent evaluation criteria.
- Task Completion Assessment: Determine if an agent successfully completed a complex task, such as summarizing a document or generating code.
- Bias Detection: Assess whether an agent's responses exhibit unintended biases across demographic groups.
Key capabilities
- Defines custom evaluation metrics.
- Sets thresholds for acceptable performance.
- Provides structured assessment of AI agent outputs.
- Facilitates comparison between different agents or versions.
Example prompts
- "Evaluate the following response: [agent output] using criteria for helpfulness and accuracy."
- "Assess this code generation result: [code] against these requirements: [requirements]."
- "Compare agent A's summary of this article to agent B’s, based on completeness and conciseness.”
Tips & gotchas
The effectiveness of this skill depends on clearly defined evaluation criteria. Vague or subjective metrics will lead to inconsistent results; ensure your criteria are specific and measurable for optimal performance.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.