Run a Multimodal Model Locally with Ollama

This example shows how to run a Multi-Modal Large Language Model (LLM) locally to describe an image through Ollama in Python.

The model used is Qwen 2.5 VL (Vision Language).

Install it with Ollama:

$ ollama run qwen2.5vl:latest

The image file is a regular JPEG file stored locally.
We assume the file is stored in the same directory as the script and is named “duck.jpeg”.

Ensure the Python Ollama library is installed using:

$ pip install ollama

We call the model with the chat() function.

The response object contains the resulting description in an object called “message”, and specifically the field “content”.

Complete script:

from ollama import chat

response = chat(
  model='qwen2.5vl',
  messages=[
    {
      'role': 'user',
      'content': 'Describe this image',
      'images': ['./duck.jpeg']
    }
  ]
)

resultDescription = response['message']['content']

print(resultDescription)

Example output:

The image shows a yellow rubber duck floating in a pool of water. 
The rubber duck is wearing black sunglasses and has a red beak. 
The water around the duck is clear and blue, 
with gentle ripples reflecting the duck and its sunglasses. 
The overall scene conveys a playful and summery atmosphere, 
often associated with leisure and fun in a pool setting.

Leave a Reply Cancel reply