Nemo Curator
Nemo Curator organizes your notes & documents into a structured knowledge base, boosting productivity and recall by intelligently grouping related information.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add nemo-curator npx -- -y @trustedskills/nemo-curator
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"nemo-curator": {
"command": "npx",
"args": [
"-y",
"@trustedskills/nemo-curator"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
Nemo Curator is an NVIDIA toolkit designed for preparing high-quality training data for Large Language Models (LLMs). It provides GPU-accelerated capabilities for organizing notes and documents, including deduplication, filtering low-quality or toxic content, and handling multi-modal datasets (text, images, video, audio). This skill significantly speeds up the data curation process compared to CPU-based alternatives.
When to use it
- Preparing LLM training data from web scrapes like Common Crawl.
- Performing fast deduplication of large datasets.
- Curating datasets containing various media types (text, images, video, audio).
- Filtering out low-quality or toxic content within a dataset.
- Scaling data processing across a GPU cluster for increased efficiency.
Key capabilities
- GPU-Accelerated Deduplication: Offers 16x faster fuzzy deduplication compared to CPU alternatives.
- Multi-Modal Data Handling: Supports curation of datasets containing text, images, video, and audio.
- Quality Filtering: Includes over 30 heuristic filters (e.g., word count, repeated lines, URL ratio) for identifying and removing low-quality data.
- Scalability: Designed to scale efficiently across GPU nodes.
Example prompts
While this skill is primarily a tool for data preparation rather than direct interaction, an agent might use it by requesting:
- "Run Nemo Curator to deduplicate the dataset from Common Crawl."
- "Filter the image and text dataset using Nemo Curator's quality filters."
- "Process the video dataset with Nemo Curator, removing any content flagged as toxic."
Tips & gotchas
- CUDA Requirement: Requires NVIDIA CUDA for optimal performance; CPU-only mode is available but significantly slower.
- Installation Complexity: Installation involves using
uv pip installand specifying desired features (e.g.,nemo-curator[text_cuda12]). - Not a General Data Processing Tool: Consider alternatives like Ray Data if your needs extend beyond data curation for LLM training.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.