Tensorrt Llm
Provides LLMs guidance and assistance for building AI and machine learning applications.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add tensorrt-llm npx -- -y @trustedskills/tensorrt-llm
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"tensorrt-llm": {
"command": "npx",
"args": [
"-y",
"@trustedskills/tensorrt-llm"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill enables AI agents to deploy and manage Large Language Models using NVIDIA TensorRT-LLM, a high-performance inference engine. It optimizes model execution through quantization, kernel fusion, and static batching to drastically reduce latency on GPU hardware.
When to use it
- Deploying open-source LLMs like Llama or Mistral in production environments requiring low-latency responses.
- Optimizing inference throughput for high-concurrency chatbot applications or API services.
- Reducing memory footprint and computational cost by applying dynamic quantization techniques.
- Serving models on NVIDIA GPUs to leverage hardware-specific accelerations not available in standard Python runtimes.
Key capabilities
- Model Optimization: Automatically applies quantization (INT8, FP16) to reduce model size and speed up inference.
- Kernel Fusion: Combines multiple operations into single GPU kernels to minimize overhead.
- Static Batching: Groups incoming requests to maximize GPU utilization during inference.
- Engine Building: Compiles PyTorch models into optimized TensorRT engines for deployment.
Example prompts
- "Build a TensorRT-LLM engine from the Llama-2-7b model using FP16 precision and configure it for 8-GPU parallel serving."
- "Optimize this Hugging Face model by applying INT4 quantization and generate an inference engine compatible with A100 GPUs."
- "Set up a static batching configuration for the Mistral-7B model to handle 50 concurrent requests with minimal latency."
Tips & gotchas
Ensure your environment has compatible NVIDIA drivers and CUDA versions installed before attempting to build engines. Quantization can significantly improve speed but may slightly reduce output quality, so test different precision levels (INT8 vs FP16) based on your accuracy requirements.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.