Evaluation Harness

🌐Community
by patricio0312rev · vlatest · Repository

This Evaluation Harness assesses model outputs against defined criteria, ensuring consistent and reliable performance evaluations for improved results.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add evaluation-harness npx -- -y @trustedskills/evaluation-harness
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "evaluation-harness": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/evaluation-harness"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

The Evaluation Harness skill enables AI agents to systematically evaluate model outputs against predefined criteria using a structured dataset and scoring methods. It allows for consistent performance assessments of LLM applications, providing detailed scores based on metrics like exact match, semantic similarity, and rubric-based evaluations judged by another LLM. This facilitates improved results through data-driven feedback and iterative refinement.

When to use it

  • Evaluating the quality of code generated by a model against expected solutions.
  • Assessing the accuracy and style of text generation tasks.
  • Benchmarking different models on a standardized dataset with defined scoring rubrics.
  • Identifying areas where a model struggles based on category or specific criteria.
  • Automating the evaluation process for LLM applications to ensure consistent performance.

Key capabilities

  • Golden Dataset Format: Uses a structured JSON format for test cases, including input prompts, expected outputs, rubrics (criteria and weights), and metadata.
  • Scoring Methods: Provides multiple scoring functions: exact match, semantic similarity (using embeddings), keyword presence, and LLM-based rubric evaluation.
  • Test Runner: Automates the process of running evaluations against a model function, generating scores for each test case.
  • Thresholds & Pass Criteria: Allows defining minimum score thresholds per category to determine if a test case "passed."
  • Reporting: Generates reports summarizing evaluation results, including individual scores and overall pass/fail status.

Example prompts

  • "Run the Evaluation Harness on this model function using the dataset located at /path/to/dataset.json."
  • "Evaluate the code generation model's performance against the 'code_generation' category in the evaluation dataset, focusing on correctness and style."
  • "Generate a report summarizing the results of running the Evaluation Harness with the provided model function and dataset."

Tips & gotchas

  • Requires a properly formatted "golden dataset" (JSON file) defining test cases, expected outputs, and rubrics.
  • The score_semantic_similarity method depends on an external embedding generation function (get_embedding). This needs to be defined or available for the skill to work correctly.
  • For LLM-based rubric evaluation (score_with_llm), access to a functional LLM is required, as it utilizes an LLM to judge outputs against criteria.

Tags

🛡️

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
patricio0312rev
Installs
31

🌐 Community

Passed automated security scans.