Multimodal Llm
Provides LLMs guidance and assistance for building AI and machine learning applications.
Install on your platform
We auto-selected Claude Code based on this skill’s supported platforms.
Run in terminal (recommended)
claude mcp add multimodal-llm npx -- -y @trustedskills/multimodal-llm
Or manually add to ~/.claude/settings.json
{
"mcpServers": {
"multimodal-llm": {
"command": "npx",
"args": [
"-y",
"@trustedskills/multimodal-llm"
]
}
}
}Requires Claude Code (claude CLI). Run claude --version to verify your install.
About This Skill
What it does
This skill integrates vision, audio, and video generation capabilities into AI agents using leading multimodal models. It enables tasks like image captioning and visual question answering, document understanding (including OCR and table extraction), speech-to-text transcription, text-to-speech synthesis, and AI video generation using providers such as Kling, Sora, Veo, and Runway. The skill leverages specific model IDs from Anthropic, OpenAI, Google, and Veo for these functionalities.
When to use it
- Image Analysis: You need an agent to analyze images, identify objects, or answer questions about visual content.
- Document Processing: Your workflow requires extracting data from PDFs, charts, or diagrams.
- Audio Transcription: You want to automatically transcribe audio recordings into text.
- Video Generation: You require assistance in generating videos using AI models like Kling, Sora, Veo, or Runway.
- Voice Agent Development: Building a voice agent that requires both speech-to-text and text-to-speech capabilities.
Key capabilities
- Image captioning and visual question answering (VQA)
- Document Optical Character Recognition (OCR) and table extraction
- Speech-to-Text transcription with speaker diarization
- Text-to-Speech voice synthesis
- AI video generation using Kling v3, Sora 2, Veo 3.1 (std/lite/fast tiers), and Runway Gen-4.5 (gen4_turbo)
- Model selection based on cost optimization and image size limits
Example prompts
- "Analyze this image [base64 encoded image data] and describe the objects present."
- "Transcribe this audio file [audio file path/URL]."
- "Generate a short video of [video description] using Veo in fast mode."
- “Extract all tables from this PDF document [PDF file path].”
Tips & gotchas
- Image Encoding: When sending images for analysis, ensure they are base64 encoded.
- Image Resizing: Resize images before encoding to optimize performance and adhere to model limitations.
- Model Selection: Carefully consider the provider based on accuracy requirements, cost constraints, and context window needs.
Tags
TrustedSkills Verification
Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.
Security Audits
| Gen Agent Trust Hub | Pass |
| Socket | Pass |
| Snyk | Pass |
🌐 Community
Passed automated security scans.