Evaluation Harness
This Evaluation Harness assesses model outputs against defined criteria, ensuring consistent and reliable performance evaluations for improved results.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add evaluation-harness npx -- -y @trustedskills/evaluation-harness
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"evaluation-harness": {
"command": "npx",
"args": [
"-y",
"@trustedskills/evaluation-harness"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
The Evaluation Harness skill enables AI agents to systematically evaluate model outputs against predefined criteria using a structured dataset and scoring methods. It allows for consistent performance assessments of LLM applications, providing detailed scores based on metrics like exact match, semantic similarity, and rubric-based evaluations judged by another LLM. This facilitates improved results through data-driven feedback and iterative refinement.
When to use it
- Evaluating the quality of code generated by a model against expected solutions.
- Assessing the accuracy and style of text generation tasks.
- Benchmarking different models on a standardized dataset with defined scoring rubrics.
- Identifying areas where a model struggles based on category or specific criteria.
- Automating the evaluation process for LLM applications to ensure consistent performance.
Key capabilities
- Golden Dataset Format: Uses a structured JSON format for test cases, including input prompts, expected outputs, rubrics (criteria and weights), and metadata.
- Scoring Methods: Provides multiple scoring functions: exact match, semantic similarity (using embeddings), keyword presence, and LLM-based rubric evaluation.
- Test Runner: Automates the process of running evaluations against a model function, generating scores for each test case.
- Thresholds & Pass Criteria: Allows defining minimum score thresholds per category to determine if a test case "passed."
- Reporting: Generates reports summarizing evaluation results, including individual scores and overall pass/fail status.
Example prompts
- "Run the Evaluation Harness on this model function using the dataset located at
/path/to/dataset.json." - "Evaluate the code generation model's performance against the 'code_generation' category in the evaluation dataset, focusing on correctness and style."
- "Generate a report summarizing the results of running the Evaluation Harness with the provided model function and dataset."
Tips & gotchas
- Requires a properly formatted "golden dataset" (JSON file) defining test cases, expected outputs, and rubrics.
- The
score_semantic_similaritymethod depends on an external embedding generation function (get_embedding). This needs to be defined or available for the skill to work correctly. - For LLM-based rubric evaluation (
score_with_llm), access to a functional LLM is required, as it utilizes an LLM to judge outputs against criteria.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.