Langsmith Evaluator
Langsmith Evaluator assesses LLM outputs against predefined criteria, streamlining feedback and improving model performance through structured analysis.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add langsmith-evaluator npx -- -y @trustedskills/langsmith-evaluator
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"langsmith-evaluator": {
"command": "npx",
"args": [
"-y",
"@trustedskills/langsmith-evaluator"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
The Langsmith Evaluator skill enables AI agents to assess LLM outputs against predefined criteria, providing structured feedback and facilitating model performance improvement. It allows for both offline evaluation (comparing agent output to expected values in a dataset) and online evaluation (real-time quality checks on production runs). Crucially, evaluators must be written after inspecting actual agent outputs and understanding their structure using Langsmith traces.
When to use it
This skill is useful when:
- You want to compare an AI agent's output against expected values in a dataset for evaluation purposes.
- You need real-time quality checks on production runs of your AI agent, without relying on reference data.
- You are looking for structured feedback and analysis of LLM outputs beyond simple text comparisons.
- You want to systematically improve the performance of an AI agent by identifying areas where its output deviates from desired results.
Key capabilities
- Offline Evaluation: Evaluates agent outputs against dataset examples.
- Online Evaluation: Performs real-time quality checks on production runs.
- Structured Feedback: Provides metrics based on predefined criteria, rather than freeform text.
- Trace Integration: Leverages Langsmith traces to understand output structure and debug evaluation logic.
- CLI Tooling: Includes a CLI tool for uploading evaluators.
Example prompts
While this skill doesn't directly respond to user prompts, here are examples of how you might use it after an agent has run:
- "Run the Langsmith Evaluator against this agent output and compare it to the expected answer in dataset 'My Dataset'."
- "Perform a real-time quality check on this production run using the 'Accuracy' evaluator."
- "Show me the Langsmith traces for this run so I can understand the structure of the output before writing an evaluator."
Tips & gotchas
- Inspect Before You Implement: The most important step is to examine agent outputs and their structure within Langsmith traces before writing any evaluation logic. Assume nothing about the format – always verify it first.
- Single Metric per Evaluator: Each evaluator function must return only one metric. For multiple metrics, create separate evaluator functions.
- Authentication Required: You'll need to set the
LANGSMITH_API_KEYenvironment variable or use the--api-keyflag when using the CLI tools.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🏢 Official
Published by the company or team that built the technology.