Ai Multimodal

Name: Ai Multimodal
Author: mrgoonie

🌐Community

by mrgoonie · vlatest · Repository

Analyzes images, audio, and text together to provide comprehensive insights and contextually relevant responses.

Install on your platform

We auto-selected Claude Code based on this skill’s supported platforms.

Run in terminal (recommended)

terminal

claude mcp add ai-multimodal npx -- -y @trustedskills/ai-multimodal

Or manually add to ~/.claude/settings.json

~/.claude/settings.json

{
  "mcpServers": {
    "ai-multimodal": {
      "command": "npx",
      "args": [
        "-y",
        "@trustedskills/ai-multimodal"
      ]
    }
  }
}

Requires Claude Code (claude CLI). Run claude --version to verify your install.

About This Skill

What it does

This skill enables AI agents to analyze and process a wide range of multimedia content including audio, images, videos, documents, and generate new images using Google Gemini's multimodal API. It provides a unified interface for understanding various media types and generating responses based on their combined context. The skill supports transcription, summarization, object detection, visual question answering, document extraction, and text-to-image generation.

When to use it

Analyze YouTube videos to understand the content and key events.
Extract data from PDFs, including tables, charts, and diagrams.
Generate images based on textual descriptions, with control over style and aspect ratio.
Summarize long audio recordings or video files.
Compare multiple images to identify similarities or differences.

Key capabilities

Audio Processing: Transcription (with timestamps up to 9.5 hours), summarization, speech understanding, speaker identification, text-to-speech generation.
Image Understanding: Captioning, object detection, pixel-level segmentation, visual question answering, OCR and text extraction.
Video Analysis: Scene detection, video Q&A, transcription with visual descriptions, YouTube URL support.
Document Extraction: PDF processing (up to 1,000 pages), table/form extraction, chart analysis, structured data output (JSON schema).
Image Generation: Text-to-image generation, image editing, multi-image composition, iterative refinement.

Example prompts

"Summarize this YouTube video: [YouTube URL]"
"Extract the key information from this PDF document."
"Generate an image of a cat wearing a hat in a 16:9 aspect ratio."
“What objects are present in this image?” (accompanied by an uploaded image)

Tips & gotchas

The skill utilizes Google Gemini's multimodal API, and performance may vary based on the complexity of the input.
For optimal results, use the "gemini-2.5-pro" model as recommended.
Document processing is limited to 1,000 pages.

View Repository →

TrustedSkills Verification

Unlike other registries that point to live repositories, TrustedSkills pins every skill to a verified commit hash. This protects you from malicious updates — what you install today is exactly what was reviewed and verified.

Security Audits

Gen Agent Trust Hub	Pass
Socket	Pass
Snyk	Pass

Details

Version: vlatest
License
Author: mrgoonie
Installs: 172

Repository (canonical source) →

🌐 Community

Passed automated security scans.