Multimodal Llm

Name: Multimodal Llm
Author: yonatangross

🌐Community

by yonatangross · vlatest · Repository

Provides LLMs guidance and assistance for building AI and machine learning applications.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add multimodal-llm npx -- -y @trustedskills/multimodal-llm

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "multimodal-llm": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/multimodal-llm"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill integrates vision, audio, and video generation capabilities into AI agents using leading multimodal models. It enables tasks like image captioning and visual question answering, document understanding (including OCR and table extraction), speech-to-text transcription, text-to-speech synthesis, and AI video generation using providers such as Kling, Sora, Veo, and Runway. The skill leverages specific model IDs from Anthropic, OpenAI, Google, and Veo for these functionalities.

When to use it

Image Analysis: You need an agent to analyze images, identify objects, or answer questions about visual content.
Document Processing: Your workflow requires extracting data from PDFs, charts, or diagrams.
Audio Transcription: You want to automatically transcribe audio recordings into text.
Video Generation: You require assistance in generating videos using AI models like Kling, Sora, Veo, or Runway.
Voice Agent Development: Building a voice agent that requires both speech-to-text and text-to-speech capabilities.

Key capabilities

Image captioning and visual question answering (VQA)
Document Optical Character Recognition (OCR) and table extraction
Speech-to-Text transcription with speaker diarization
Text-to-Speech voice synthesis
AI video generation using Kling v3, Sora 2, Veo 3.1 (std/lite/fast tiers), and Runway Gen-4.5 (gen4_turbo)
Model selection based on cost optimization and image size limits

Example prompts

"Analyze this image [base64 encoded image data] and describe the objects present."
"Transcribe this audio file [audio file path/URL]."
"Generate a short video of [video description] using Veo in fast mode."
“Extract all tables from this PDF document [PDF file path].”

Tips & gotchas

Image Encoding: When sending images for analysis, ensure they are base64 encoded.
Image Resizing: Resize images before encoding to optimize performance and adhere to model limitations.
Model Selection: Carefully consider the provider based on accuracy requirements, cost constraints, and context window needs.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: yonatangross
Installs: 42

Repository (canonical source) →

🌐 Community

Passed automated security scans.