Ml Data Pipeline Architecture
Helps with machine learning, data, pipeline automation as part of automating DevOps pipelines and CI/CD workflows workflows.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add ml-data-pipeline-architecture npx -- -y @trustedskills/ml-data-pipeline-architecture
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"ml-data-pipeline-architecture": {
"command": "npx",
"args": [
"-y",
"@trustedskills/ml-data-pipeline-architecture"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill assists in designing and optimizing machine learning data pipelines, with a strong focus on efficiency and performance. It provides guidance on choosing between Polars and Pandas for data processing, implementing zero-copy architectures using Arrow to minimize memory usage, and integrating ClickHouse databases into PyTorch workflows. The skill leverages patterns and best practices for building robust and scalable ML data pipelines within DevOps and CI/CD environments.
When to use it
- When deciding between Polars and Pandas for a data pipeline task.
- To optimize memory usage when working with large datasets using Arrow's zero-copy capabilities.
- For loading data from ClickHouse databases into PyTorch DataLoaders efficiently.
- When implementing lazy evaluation techniques to handle very large datasets.
- If you need to migrate existing Pandas codebases to Polars for improved performance.
Key capabilities
- Polars vs. Pandas Decision Guidance: Provides a decision tree based on dataset size and operations to help choose the appropriate library.
- Zero-Copy Architecture Implementation: Demonstrates how to minimize memory copies when moving data between databases, pandas DataFrames, NumPy arrays, and PyTorch tensors using Arrow.
- ClickHouse Integration Patterns: Offers multiple patterns for integrating ClickHouse databases with Polars and PyTorch, including Arrow streaming and native Polars support.
- Lazy Evaluation Support: Guidance on utilizing lazy evaluation techniques within Polars for large datasets.
Example prompts
- "I have a dataset of 50 million rows. Should I use Pandas or Polars?"
- "How can I load data from ClickHouse into PyTorch without creating unnecessary copies in memory?"
- "Show me an example of using Arrow to efficiently process data from ClickHouse with Polars."
Tips & gotchas
- Polars Preference: This skill defaults to using Polars. To use Pandas, you must add
# polars-exception: <reason>at the top of the file. - Self-Evolving Skill: The skill is designed to improve through usage; fix any incorrect instructions or workarounds directly within this file.
- Version Requirements: When using Polars' native ClickHouse support, ensure compatibility by checking version requirements (see pola.rs).
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.