Ai Multimodal

🌐Community
by mrgoonie Β· vlatest Β· Repository

Analyzes images, audio, and text together to provide comprehensive insights and contextually relevant responses.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

1

Run in terminal (recommended)

terminal
claude mcp add ai-multimodal npx -- -y @trustedskills/ai-multimodal
2

Or manually add to ~/.claude/settings.json

~/.claude/settings.json
{
  "mcpServers": {
    "ai-multimodal": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/ai-multimodal"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill enables AI agents to analyze and process a wide range of multimedia content including audio, images, videos, documents, and generate new images using Google Gemini's multimodal API. It provides a unified interface for understanding various media types and generating responses based on their combined context. The skill supports transcription, summarization, object detection, visual question answering, document extraction, and text-to-image generation.

When to use it

  • Analyze YouTube videos to understand the content and key events.
  • Extract data from PDFs, including tables, charts, and diagrams.
  • Generate images based on textual descriptions, with control over style and aspect ratio.
  • Summarize long audio recordings or video files.
  • Compare multiple images to identify similarities or differences.

Key capabilities

  • Audio Processing: Transcription (with timestamps up to 9.5 hours), summarization, speech understanding, speaker identification, text-to-speech generation.
  • Image Understanding: Captioning, object detection, pixel-level segmentation, visual question answering, OCR and text extraction.
  • Video Analysis: Scene detection, video Q&A, transcription with visual descriptions, YouTube URL support.
  • Document Extraction: PDF processing (up to 1,000 pages), table/form extraction, chart analysis, structured data output (JSON schema).
  • Image Generation: Text-to-image generation, image editing, multi-image composition, iterative refinement.

Example prompts

  • "Summarize this YouTube video: [YouTube URL]"
  • "Extract the key information from this PDF document."
  • "Generate an image of a cat wearing a hat in a 16:9 aspect ratio."
  • β€œWhat objects are present in this image?” (accompanied by an uploaded image)

Tips & gotchas

  • The skill utilizes Google Gemini's multimodal API, and performance may vary based on the complexity of the input.
  • For optimal results, use the "gemini-2.5-pro" model as recommended.
  • Document processing is limited to 1,000 pages.

Tags

πŸ›‘οΈ

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates β€” what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust HubPass
SocketPass
SnykPass

Details

Version
vlatest
License
Author
mrgoonie
Installs
172

🌐 Community

Passed automated security scans.