Nemo Curator

🌐Community
by davila7 · vlatest · Repository

Nemo Curator organizes your notes & documents into a structured knowledge base, boosting productivity and recall by intelligently grouping related information.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add nemo-curator npx -- -y @trustedskills/nemo-curator
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "nemo-curator": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/nemo-curator"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

Nemo Curator is an NVIDIA toolkit designed for preparing high-quality training data for Large Language Models (LLMs). It provides GPU-accelerated capabilities for organizing notes and documents, including deduplication, filtering low-quality or toxic content, and handling multi-modal datasets (text, images, video, audio). This skill significantly speeds up the data curation process compared to CPU-based alternatives.

When to use it

  • Preparing LLM training data from web scrapes like Common Crawl.
  • Performing fast deduplication of large datasets.
  • Curating datasets containing various media types (text, images, video, audio).
  • Filtering out low-quality or toxic content within a dataset.
  • Scaling data processing across a GPU cluster for increased efficiency.

Key capabilities

  • GPU-Accelerated Deduplication: Offers 16x faster fuzzy deduplication compared to CPU alternatives.
  • Multi-Modal Data Handling: Supports curation of datasets containing text, images, video, and audio.
  • Quality Filtering: Includes over 30 heuristic filters (e.g., word count, repeated lines, URL ratio) for identifying and removing low-quality data.
  • Scalability: Designed to scale efficiently across GPU nodes.

Example prompts

While this skill is primarily a tool for data preparation rather than direct interaction, an agent might use it by requesting:

  1. "Run Nemo Curator to deduplicate the dataset from Common Crawl."
  2. "Filter the image and text dataset using Nemo Curator's quality filters."
  3. "Process the video dataset with Nemo Curator, removing any content flagged as toxic."

Tips & gotchas

  • CUDA Requirement: Requires NVIDIA CUDA for optimal performance; CPU-only mode is available but significantly slower.
  • Installation Complexity: Installation involves using uv pip install and specifying desired features (e.g., nemo-curator[text_cuda12]).
  • Not a General Data Processing Tool: Consider alternatives like Ray Data if your needs extend beyond data curation for LLM training.

Tags

🛡️

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
davila7
Installs
162

🌐 Community

Passed automated security scans.