Validate Evaluator

Name: Validate Evaluator
Author: hamelsmu

🌐Community

by hamelsmu · vlatest · Repository

Checks evaluator outputs against ground truth data to identify biases, errors, and performance inconsistencies.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add validate-evaluator npx -- -y @trustedskills/validate-evaluator

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "validate-evaluator": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/validate-evaluator"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill, Validate Evaluator, helps calibrate an LLM-based evaluator (judge) against human judgment. It guides you through splitting labeled data into training, development, and test sets, then iteratively refining the judge's prompt until its True Positive Rate (TPR) and True Negative Rate (TNR) exceed 90% on the development set. Finally, it allows for a final accuracy measurement using a held-out test set and provides a bias correction formula to apply to production data.

When to use it

You have built an LLM judge prompt and want to ensure its accuracy and reliability.
You need to compare the performance of your LLM judge against human evaluations.
You suspect biases or inconsistencies in your LLM evaluator's outputs.
You are preparing a production-ready LLM evaluation system.

Key capabilities

Data splitting into training, development, and test sets.
Measurement of True Positive Rate (TPR) and True Negative Rate (TNR).
Iterative refinement of the judge prompt based on development set performance.
Final accuracy assessment using a held-out test set.
Provides a bias correction formula for production data.

Example prompts

"Validate Evaluator: I have human-labeled data, split it into training, dev, and test sets."
"Validate Evaluator: Run the judge on my development set and tell me the TPR and TNR."
"Validate Evaluator: How do I adjust my LLM prompt to improve performance on the development set?"

Tips & gotchas

Requires a pre-existing LLM judge prompt (created using a skill like "write-judge-prompt").
Needs approximately 100 human-labeled traces per failure mode, with roughly equal numbers of Pass and Fail examples.
Labels must be provided by domain experts to ensure accuracy.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: hamelsmu
Installs: 48

Repository (canonical source) →

🌐 Community

Passed automated security scans.