Validate Evaluator
Checks evaluator outputs against ground truth data to identify biases, errors, and performance inconsistencies.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add validate-evaluator npx -- -y @trustedskills/validate-evaluator
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"validate-evaluator": {
"command": "npx",
"args": [
"-y",
"@trustedskills/validate-evaluator"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill, Validate Evaluator, helps calibrate an LLM-based evaluator (judge) against human judgment. It guides you through splitting labeled data into training, development, and test sets, then iteratively refining the judge's prompt until its True Positive Rate (TPR) and True Negative Rate (TNR) exceed 90% on the development set. Finally, it allows for a final accuracy measurement using a held-out test set and provides a bias correction formula to apply to production data.
When to use it
- You have built an LLM judge prompt and want to ensure its accuracy and reliability.
- You need to compare the performance of your LLM judge against human evaluations.
- You suspect biases or inconsistencies in your LLM evaluator's outputs.
- You are preparing a production-ready LLM evaluation system.
Key capabilities
- Data splitting into training, development, and test sets.
- Measurement of True Positive Rate (TPR) and True Negative Rate (TNR).
- Iterative refinement of the judge prompt based on development set performance.
- Final accuracy assessment using a held-out test set.
- Provides a bias correction formula for production data.
Example prompts
- "Validate Evaluator: I have human-labeled data, split it into training, dev, and test sets."
- "Validate Evaluator: Run the judge on my development set and tell me the TPR and TNR."
- "Validate Evaluator: How do I adjust my LLM prompt to improve performance on the development set?"
Tips & gotchas
- Requires a pre-existing LLM judge prompt (created using a skill like "write-judge-prompt").
- Needs approximately 100 human-labeled traces per failure mode, with roughly equal numbers of Pass and Fail examples.
- Labels must be provided by domain experts to ensure accuracy.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.