Tensorrt Llm

Name: Tensorrt Llm
Author: davila7

🌐Community

by davila7 · vlatest · Repository

Provides LLMs guidance and assistance for building AI and machine learning applications.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add tensorrt-llm npx -- -y @trustedskills/tensorrt-llm

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "tensorrt-llm": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/tensorrt-llm"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill enables AI agents to deploy and manage Large Language Models using NVIDIA TensorRT-LLM, a high-performance inference engine. It optimizes model execution through quantization, kernel fusion, and static batching to drastically reduce latency on GPU hardware.

When to use it

Deploying open-source LLMs like Llama or Mistral in production environments requiring low-latency responses.
Optimizing inference throughput for high-concurrency chatbot applications or API services.
Reducing memory footprint and computational cost by applying dynamic quantization techniques.
Serving models on NVIDIA GPUs to leverage hardware-specific accelerations not available in standard Python runtimes.

Key capabilities

Model Optimization: Automatically applies quantization (INT8, FP16) to reduce model size and speed up inference.
Kernel Fusion: Combines multiple operations into single GPU kernels to minimize overhead.
Static Batching: Groups incoming requests to maximize GPU utilization during inference.
Engine Building: Compiles PyTorch models into optimized TensorRT engines for deployment.

Example prompts

"Build a TensorRT-LLM engine from the Llama-2-7b model using FP16 precision and configure it for 8-GPU parallel serving."
"Optimize this Hugging Face model by applying INT4 quantization and generate an inference engine compatible with A100 GPUs."
"Set up a static batching configuration for the Mistral-7B model to handle 50 concurrent requests with minimal latency."

Tips & gotchas

Ensure your environment has compatible NVIDIA drivers and CUDA versions installed before attempting to build engines. Quantization can significantly improve speed but may slightly reduce output quality, so test different precision levels (INT8 vs FP16) based on your accuracy requirements.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: davila7
Installs: 165

Repository (canonical source) →

🌐 Community

Passed automated security scans.