
Generate image descriptions with Llama 3.2 Vision
The MAX framework simplifies the process to create an endpoint for multimodal models that handle both text and images, such as Llama 3.2 11B Vision Instruct, which excels at tasks such as image captioning and visual question answering. This tutorial walks you through installing the necessary tools, configuring access, and serving the model locally with an OpenAI-compatible endpoint.
System requirements:
Mac
Linux
WSL
GPU
Set up your environment
Create a Python project to install our APIs and CLI tools:
- pixi
- uv
- pip
- conda
- If you don't have it, install
pixi
:curl -fsSL https://pixi.sh/install.sh | sh
curl -fsSL https://pixi.sh/install.sh | sh
Then restart your terminal for the changes to take effect.
- Create a project:
pixi init vision-tutorial \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd vision-tutorialpixi init vision-tutorial \
-c https://conda.modular.com/max-nightly/ -c conda-forge \
&& cd vision-tutorial - Install the
modular
conda package:- Nightly
- Stable
pixi add modular
pixi add modular
pixi add "modular=25.4"
pixi add "modular=25.4"
- Start the virtual environment:
pixi shell
pixi shell
- If you don't have it, install
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Then restart your terminal to make
uv
accessible. - Create a project:
uv init vision-tutorial && cd vision-tutorial
uv init vision-tutorial && cd vision-tutorial
- Create and start a virtual environment:
uv venv && source .venv/bin/activate
uv venv && source .venv/bin/activate
- Install the
modular
Python package:- Nightly
- Stable
uv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/ \
--index-strategy unsafe-best-match --prerelease allowuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-matchuv pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/ \
--index-strategy unsafe-best-match
- Create a project folder:
mkdir vision-tutorial && cd vision-tutorial
mkdir vision-tutorial && cd vision-tutorial
- Create and activate a virtual environment:
python3 -m venv .venv/vision-tutorial \
&& source .venv/vision-tutorial/bin/activatepython3 -m venv .venv/vision-tutorial \
&& source .venv/vision-tutorial/bin/activate - Install the
modular
Python package:- Nightly
- Stable
pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install --pre modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--index-url https://dl.modular.com/public/nightly/python/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/pip install modular \
--extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://modular.gateway.scarf.sh/simple/
- If you don't have it, install conda. A common choice is with
brew
:brew install miniconda
brew install miniconda
- Initialize
conda
for shell interaction:conda init
conda init
If you're on a Mac, instead use:
conda init zsh
conda init zsh
Then restart your terminal for the changes to take effect.
- Create a project:
conda create -n vision-tutorial
conda create -n vision-tutorial
- Start the virtual environment:
conda activate vision-tutorial
conda activate vision-tutorial
- Install the
modular
conda package:- Nightly
- Stable
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max-nightly/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
conda install -c conda-forge -c https://conda.modular.com/max/ modular
Serve your model
To get the model used in this tutorial, you must have a Hugging Face user access token and approved access to the Llama 3.2 11B Vision Instruct Hugging Face repo.
To create a Hugging Face user access token, see Access Tokens. Within your local environment, save your access token as an environment variable:
export HF_TOKEN="hf_..."
export HF_TOKEN="hf_..."
Use the max serve
command to start a
local model server with the Llama 3.2 Vision model:
max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
max serve \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-length 108172 \
--max-batch-size 1
This will create a server running the Llama-3.2-11B-Vision-Instruct
text-to-image model on http://localhost:8000/v1/chat/completions
, an OpenAI
compatible endpoint.
While this example uses the Llama 3.2 Vision model, you can replace it with any of the models listed in the MAX Builds site.
The endpoint is ready when you see this message printed in your terminal:
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
For a complete list of max
CLI commands and options, refer to the
MAX CLI reference.
Interact with your model
MAX supports OpenAI's REST APIs and you can interact with the model using either the OpenAI Python SDK or curl:
- Python
- curl
You can use OpenAI's Python client to interact with the vision model. First, install the OpenAI API:
- pixi
- uv
- pip
- conda
pixi add openai
pixi add openai
uv add openai
uv add openai
pip install openai
pip install openai
conda install openai
conda install openai
Then, create a client and make a request to the model:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
max_tokens=300
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
max_tokens=300
)
print(response.choices[0].message.content)
In this example, you're using the OpenAI Python client to interact with the MAX
endpoint running on local host 8000
. The client
object is initialized with
the base URL http://0.0.0.0:8000/v1
and the API key is ignored.
When you run this code, the model should respond with information about the image:
python generate-image-description.py
python generate-image-description.py
A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.
A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.
You can send requests to the local endpoint using curl
.
The following request includes an image URL and a question to answer about the
provided image:
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
}
}
]
}
],
"max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
This sends an object location to an image along with a text prompt to the model, and you should receive a response similar to this:
A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.
A rabbit is sitting in a field. It has long ears and a white belly. It is looking at the camera.
For complete details on all available API endpoints and options, see the MAX Serve API documentation.
Next steps
Now that you have successfully set up MAX with an OpenAI-compatible endpoint, checkout out these other tutorials:
Did this tutorial work for you?
Thank you! We'll create more content like this.
Thank you for helping us improve!