Llm Evaluation

🌐Community
by wshobson · vlatest · Repository

Provides LLMs guidance and assistance for building AI and machine learning applications.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add llm-evaluation npx -- -y @trustedskills/llm-evaluation
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "llm-evaluation": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/llm-evaluation"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

The llm-evaluation skill enables users to assess the performance of large language models (LLMs) by defining evaluation criteria, scoring responses, and providing detailed feedback. It supports both automated and manual evaluation methods, making it useful for refining model outputs and ensuring alignment with desired outcomes.

When to use it

  • You need to evaluate the accuracy or quality of an LLM's response to a specific query.
  • You want to compare multiple models based on predefined metrics such as relevance, coherence, or factual correctness.
  • You are iterating on prompts and need structured feedback to improve model performance.

Key capabilities

  • Automated scoring based on user-defined criteria
  • Manual evaluation with customizable rubrics
  • Comparison of multiple model responses side by side
  • Detailed feedback generation for each response

Example prompts

  • "Evaluate this LLM's response against the following criteria: accuracy, clarity, and relevance."
  • "Compare the outputs from Model A and Model B using a rubric focused on factual correctness."
  • "Provide detailed feedback on how well this model answered the question about climate change."

Tips & gotchas

  • Define clear evaluation criteria in advance to ensure consistent results.
  • Manual evaluations may be time-consuming for large datasets, so consider automating where possible.

Tags

🛡️

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
wshobson
Installs
2.8k

🌐 Community

Passed automated security scans.