Trustworthy Experiments
This skill guides you to design and execute reliable experiments by prioritizing factual accuracy and minimizing bias for robust results.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add trustworthy-experiments npx -- -y @trustedskills/trustworthy-experiments
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"trustworthy-experiments": {
"command": "npx",
"args": [
"-y",
"@trustedskills/trustworthy-experiments"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
Trustworthy Experiments is a framework designed to guide you through running controlled experiments (A/B tests) that yield reliable and actionable results. It addresses common pitfalls in experimentation, aiming to minimize false positives and ensure the validity of findings. The skill draws on industry best practices and insights from experts like Ronny Kohavi, who has identified significant rates of experiment failure and invalid results due to factors like sample ratio mismatch.
When to use it
- Designing A/B tests for product features or improvements.
- Calculating appropriate sample sizes and experiment runtime for statistical power.
- Validating the results of experiments before making decisions about implementation.
- Diagnosing unexpected or suspicious experimental outcomes (Twyman's Law).
- Selecting metrics that balance short-term gains with long-term value (OEC).
Key capabilities
- Provides a framework for designing and executing reliable A/B tests.
- Helps determine appropriate sample sizes and experiment duration.
- Offers guidance on validating experimental results to avoid false positives.
- Addresses the importance of choosing balanced metrics (OEC).
- Supports diagnosing issues with experiment outcomes using Twyman's Law.
Example prompts
- "Help me design an A/B test for [feature name] and determine the necessary sample size."
- "I ran an experiment on [metric], but the results seem suspicious. Can you help me diagnose what might be wrong?"
- "What metrics should I use to evaluate the success of [product change]?"
Tips & gotchas
- Requires a sufficient user base (tens of thousands minimum) for valid results.
- Controlled experiments are not appropriate for one-time decisions or situations lacking user choice.
- Experiments require time; they aren't suitable for immediate decision-making.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.