Evaluation Criteria

🌐Community
by masanao-ohba · vlatest · Repository

This skill generates tailored evaluation criteria for projects or tasks, ensuring objective assessments and improved outcomes.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add evaluation-criteria npx -- -y @trustedskills/evaluation-criteria
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "evaluation-criteria": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/evaluation-criteria"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill provides evaluation criteria for assessing the performance of AI agents. It allows users to define specific metrics and thresholds, enabling a structured approach to evaluating agent behavior and outputs. The skill can be used to determine if an agent meets predefined standards or identify areas needing improvement.

When to use it

  • Performance Review: Evaluate an agent's response quality against established benchmarks for customer service chatbots.
  • A/B Testing: Compare the effectiveness of two different AI agents by applying consistent evaluation criteria.
  • Task Completion Assessment: Determine if an agent successfully completed a complex task, such as summarizing a document or generating code.
  • Bias Detection: Assess whether an agent's responses exhibit unintended biases across demographic groups.

Key capabilities

  • Defines custom evaluation metrics.
  • Sets thresholds for acceptable performance.
  • Provides structured assessment of AI agent outputs.
  • Facilitates comparison between different agents or versions.

Example prompts

  • "Evaluate the following response: [agent output] using criteria for helpfulness and accuracy."
  • "Assess this code generation result: [code] against these requirements: [requirements]."
  • "Compare agent A's summary of this article to agent B’s, based on completeness and conciseness.”

Tips & gotchas

The effectiveness of this skill depends on clearly defined evaluation criteria. Vague or subjective metrics will lead to inconsistent results; ensure your criteria are specific and measurable for optimal performance.

Tags

🛡️

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
masanao-ohba
Installs
5

🌐 Community

Passed automated security scans.