Evaluation Harness

Name: Evaluation Harness
Author: patricio0312rev

🌐Community

by patricio0312rev · vlatest · Repository

This Evaluation Harness assesses model outputs against defined criteria, ensuring consistent and reliable performance evaluations for improved results.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add evaluation-harness npx -- -y @trustedskills/evaluation-harness

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "evaluation-harness": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/evaluation-harness"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

The Evaluation Harness skill enables AI agents to systematically evaluate model outputs against predefined criteria using a structured dataset and scoring methods. It allows for consistent performance assessments of LLM applications, providing detailed scores based on metrics like exact match, semantic similarity, and rubric-based evaluations judged by another LLM. This facilitates improved results through data-driven feedback and iterative refinement.

When to use it

Evaluating the quality of code generated by a model against expected solutions.
Assessing the accuracy and style of text generation tasks.
Benchmarking different models on a standardized dataset with defined scoring rubrics.
Identifying areas where a model struggles based on category or specific criteria.
Automating the evaluation process for LLM applications to ensure consistent performance.

Key capabilities

Golden Dataset Format: Uses a structured JSON format for test cases, including input prompts, expected outputs, rubrics (criteria and weights), and metadata.
Scoring Methods: Provides multiple scoring functions: exact match, semantic similarity (using embeddings), keyword presence, and LLM-based rubric evaluation.
Test Runner: Automates the process of running evaluations against a model function, generating scores for each test case.
Thresholds & Pass Criteria: Allows defining minimum score thresholds per category to determine if a test case "passed."
Reporting: Generates reports summarizing evaluation results, including individual scores and overall pass/fail status.

Example prompts

"Run the Evaluation Harness on this model function using the dataset located at /path/to/dataset.json."
"Evaluate the code generation model's performance against the 'code_generation' category in the evaluation dataset, focusing on correctness and style."
"Generate a report summarizing the results of running the Evaluation Harness with the provided model function and dataset."

Tips & gotchas

Requires a properly formatted "golden dataset" (JSON file) defining test cases, expected outputs, and rubrics.
The score_semantic_similarity method depends on an external embedding generation function (get_embedding). This needs to be defined or available for the skill to work correctly.
For LLM-based rubric evaluation (score_with_llm), access to a functional LLM is required, as it utilizes an LLM to judge outputs against criteria.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: patricio0312rev
Installs: 31

Repository (canonical source) →

🌐 Community

Passed automated security scans.