Llm Evaluation
Provides LLMs guidance and assistance for building AI and machine learning applications.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add rmyndharis-llm-evaluation npx -- -y @trustedskills/rmyndharis-llm-evaluation
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"rmyndharis-llm-evaluation": {
"command": "npx",
"args": [
"-y",
"@trustedskills/rmyndharis-llm-evaluation"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill, rmyndharis-llm-evaluation, provides a framework for evaluating Large Language Models (LLMs). It allows users to define evaluation criteria and scoring rubrics. The tool then applies these rubrics to LLM outputs, generating structured reports that highlight strengths and weaknesses based on the specified metrics.
When to use it
- Comparing different LLMs: Evaluate several models against a common set of criteria to determine which best suits your needs.
- Assessing model performance after fine-tuning: Measure improvements in an LLM's capabilities following fine-tuning efforts.
- Identifying areas for improvement: Pinpoint specific weaknesses in an LLM’s responses based on defined evaluation metrics.
- Ensuring consistent quality: Maintain a baseline level of output quality across multiple LLMs or different use cases.
Key capabilities
- Define custom evaluation criteria.
- Create scoring rubrics with detailed guidelines.
- Apply rubrics to LLM outputs for structured assessment.
- Generate reports summarizing performance against defined metrics.
Example prompts
- "Evaluate this response from the LLM: '[LLM Response]' using the 'creative_writing' rubric."
- "Compare the responses of Model A and Model B to the prompt '[Prompt]' based on the 'factual_accuracy' criteria."
- “Generate a scoring rubric for evaluating code generation quality, focusing on correctness and efficiency.”
Tips & gotchas
The effectiveness of this skill depends heavily on well-defined evaluation criteria and clear scoring rubrics. Ambiguous or poorly defined metrics will lead to inconsistent and unreliable results.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.