Llm Evaluation

🌐Community
by rmyndharis · vlatest · Repository

Provides LLMs guidance and assistance for building AI and machine learning applications.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add rmyndharis-llm-evaluation npx -- -y @trustedskills/rmyndharis-llm-evaluation
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "rmyndharis-llm-evaluation": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/rmyndharis-llm-evaluation"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill, rmyndharis-llm-evaluation, provides a framework for evaluating Large Language Models (LLMs). It allows users to define evaluation criteria and scoring rubrics. The tool then applies these rubrics to LLM outputs, generating structured reports that highlight strengths and weaknesses based on the specified metrics.

When to use it

  • Comparing different LLMs: Evaluate several models against a common set of criteria to determine which best suits your needs.
  • Assessing model performance after fine-tuning: Measure improvements in an LLM's capabilities following fine-tuning efforts.
  • Identifying areas for improvement: Pinpoint specific weaknesses in an LLM’s responses based on defined evaluation metrics.
  • Ensuring consistent quality: Maintain a baseline level of output quality across multiple LLMs or different use cases.

Key capabilities

  • Define custom evaluation criteria.
  • Create scoring rubrics with detailed guidelines.
  • Apply rubrics to LLM outputs for structured assessment.
  • Generate reports summarizing performance against defined metrics.

Example prompts

  • "Evaluate this response from the LLM: '[LLM Response]' using the 'creative_writing' rubric."
  • "Compare the responses of Model A and Model B to the prompt '[Prompt]' based on the 'factual_accuracy' criteria."
  • “Generate a scoring rubric for evaluating code generation quality, focusing on correctness and efficiency.”

Tips & gotchas

The effectiveness of this skill depends heavily on well-defined evaluation criteria and clear scoring rubrics. Ambiguous or poorly defined metrics will lead to inconsistent and unreliable results.

Tags

🛡️

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
rmyndharis
Installs
15

🌐 Community

Passed automated security scans.