This example shows how to run a Multi-Modal Large Language Model (LLM) locally to describe an image through Ollama in Python.
The model used is Qwen 2.5 VL (Vision Language).
Install it with Ollama:
$ ollama run qwen2.5vl:latest
The image file is a regular JPEG file stored locally.
We assume the file is stored in the same directory as the script and is named “duck.jpeg”.
Ensure the Python Ollama library is installed using:
$ pip install ollama
We call the model with the chat() function.
The response object contains the resulting description in an object called “message”, and specifically the field “content”.
Complete script:
from ollama import chat
response = chat(
model='qwen2.5vl',
messages=[
{
'role': 'user',
'content': 'Describe this image',
'images': ['./duck.jpeg']
}
]
)
resultDescription = response['message']['content']
print(resultDescription)
Example output:
The image shows a yellow rubber duck floating in a pool of water. The rubber duck is wearing black sunglasses and has a red beak. The water around the duck is clear and blue, with gentle ripples reflecting the duck and its sunglasses. The overall scene conveys a playful and summery atmosphere, often associated with leisure and fun in a pool setting.
