Bedrock Agentcore Evaluations
This skill assesses LLM outputs using AgentCore's framework, providing structured evaluations for improved reasoning and alignment.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add bedrock-agentcore-evaluations npx -- -y @trustedskills/bedrock-agentcore-evaluations
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"bedrock-agentcore-evaluations": {
"command": "npx",
"args": [
"-y",
"@trustedskills/bedrock-agentcore-evaluations"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill provides evaluation capabilities for Bedrock agents. It allows you to assess agent performance based on predefined metrics and criteria, providing feedback loops for improvement. The evaluations can be used to track progress, identify areas of weakness, and optimize agent behavior over time.
When to use it
- Debugging Agent Behavior: Use this skill when an agent is not performing as expected to pinpoint the root cause through structured evaluation.
- Measuring Improvement: Track changes made to an agent's configuration or tools by evaluating its performance before and after modifications.
- Benchmarking Different Agents: Compare the effectiveness of multiple agents on a standardized set of tasks using consistent evaluation criteria.
- Training Data Generation: Use evaluations to identify scenarios where the agent struggles, creating targeted training data for refinement.
Key capabilities
- Evaluation metric definition
- Performance tracking
- Agent feedback loops
- Standardized task assessment
Example prompts
- "Evaluate the agent's response to this user query: 'Summarize the key findings of this research paper.'"
- "Run a performance evaluation on the agent using the 'customer service resolution rate' metric."
- "Compare the agent’s performance on Task A versus Task B, and provide a detailed report."
Tips & gotchas
This skill requires careful definition of evaluation metrics to ensure accurate and meaningful results. Ensure that your evaluation criteria are aligned with the desired agent behavior and objectives.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.