Langsmith Evaluator

Name: Langsmith Evaluator
Author: langchain-ai

🏢Official

by langchain-ai · vlatest · Repository

Langsmith Evaluator assesses LLM outputs against predefined criteria, streamlining feedback and improving model performance through structured analysis.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add langsmith-evaluator npx -- -y @trustedskills/langsmith-evaluator

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "langsmith-evaluator": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/langsmith-evaluator"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

The Langsmith Evaluator skill enables AI agents to assess LLM outputs against predefined criteria, providing structured feedback and facilitating model performance improvement. It allows for both offline evaluation (comparing agent output to expected values in a dataset) and online evaluation (real-time quality checks on production runs). Crucially, evaluators must be written after inspecting actual agent outputs and understanding their structure using Langsmith traces.

When to use it

This skill is useful when:

You want to compare an AI agent's output against expected values in a dataset for evaluation purposes.
You need real-time quality checks on production runs of your AI agent, without relying on reference data.
You are looking for structured feedback and analysis of LLM outputs beyond simple text comparisons.
You want to systematically improve the performance of an AI agent by identifying areas where its output deviates from desired results.

Key capabilities

Offline Evaluation: Evaluates agent outputs against dataset examples.
Online Evaluation: Performs real-time quality checks on production runs.
Structured Feedback: Provides metrics based on predefined criteria, rather than freeform text.
Trace Integration: Leverages Langsmith traces to understand output structure and debug evaluation logic.
CLI Tooling: Includes a CLI tool for uploading evaluators.

Example prompts

While this skill doesn't directly respond to user prompts, here are examples of how you might use it after an agent has run:

"Run the Langsmith Evaluator against this agent output and compare it to the expected answer in dataset 'My Dataset'."
"Perform a real-time quality check on this production run using the 'Accuracy' evaluator."
"Show me the Langsmith traces for this run so I can understand the structure of the output before writing an evaluator."

Tips & gotchas

Inspect Before You Implement: The most important step is to examine agent outputs and their structure within Langsmith traces before writing any evaluation logic. Assume nothing about the format – always verify it first.
Single Metric per Evaluator: Each evaluator function must return only one metric. For multiple metrics, create separate evaluator functions.
Authentication Required: You'll need to set the LANGSMITH_API_KEY environment variable or use the --api-key flag when using the CLI tools.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: langchain-ai
Installs: 45

Repository (canonical source) →

🏢 Official

Published by the company or team that built the technology.