Vision

1. Usage Scenarios

Vision-Language Models (VLM) are large language models capable of processing both visual (image) and linguistic (text) input modalities. Based on VLMs, you can input images and text, and the model can simultaneously understand the content of the images and the context while following instructions to respond. For example:

Visual Content Interpretation: The model can interpret and describe the information in an image, such as objects, text, spatial relationships, colors, and atmosphere.
Multi-turn Conversations Combining Visual Content and Context.
Partial Replacement of Traditional Machine Vision Models like OCR.
Future Applications: With continuous improvements in model capabilities, VLMs can be applied to areas such as visual agents and robotics.

2. Usage Method

For VLM models, you can invoke the /chat/completions API by constructing a message containing either an image URL or a base64-encoded image. The detail parameter can be used to control how the image is preprocessed.

2.1 Explanation of Image Detail Control Parameters

SiliconFlow provides three options for the detail parameter: low, high, and auto. For currently supported models, if detail is not specified or is set to high, the model will use the high (“high resolution”) mode. If set to low or auto, the model will use the low (“low resolution”) mode.

2.2 Example Formats for `message` Containing Images

2.2.1 Using Image URLs

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": "https://sf-maas.s3.us-east-1.amazonaws.com/images/recDq23epr.png",
                "detail":"high"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

2.2.2 Base64 Format

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                "detail":"low"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

# Example of converting an image to base64 format
from PIL import Image
import io
import base64

def convert_image_to_webp_base64(input_image_path):
    try:
        with Image.open(input_image_path) as img:
            byte_arr = io.BytesIO()
            img.save(byte_arr, format='webp')
            byte_arr = byte_arr.getvalue()
            base64_str = base64.b64encode(byte_arr).decode('utf-8')
            return base64_str
    except IOError:
        print(f"Error: Unable to open or convert the image {input_image_path}")
        return None

base64_image = convert_image_to_webp_base64(input_image_path)

2.2.3 Multiple Images, Each in Either Format

Please note that the DeepseekVL2 series models are suitable for handling short contexts. It is recommended to input no more than 2 images. If more than 2 images are provided, the model will automatically resize them to 384x384, and the specified detail parameter will be ignored.

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": "https://sf-maas.s3.us-east-1.amazonaws.com/images/recDq23epr.png",
            }
        },
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

3. Supported Models

Currently supported VLM models:

THUDM series:
- THUDM/GLM-4.1V-9B-Thinking
Qwen Series:
- Qwen/Qwen2-VL-72B-Instruct
DeepseekVL2 Series:
- deepseek-ai/deepseek-vl2

Note: The list of supported VLM models may change. Please filter by the “Visual” tag in the “Models” to check the supported model list.

4. Billing for Visual Input Content

For visual inputs like images, the model converts them into tokens, which are combined with textual information as part of the model’s output context. This means visual inputs are also billed. Different models use different methods for converting visual content, as outlined below.

4.1 Qwen Series

Rules:

Qwen supports a maximum pixel area of 3584 * 3584 = 12845056 and a minimum pixel area of 56 * 56 = 3136. Each image’s longer and shorter sides are first scaled to multiples of 28 (h * 28) * (w * 28). If the dimensions fall outside the minimum and maximum pixel ranges, the image is proportionally resized to fit within the range.

When detail=low, all images are resized to 448 * 448, consuming 256 tokens.
When detail=high, the image is proportionally scaled, with its dimensions rounded up to the nearest multiple of 28, then resized to fit within the pixel range (3136, 12845056), ensuring both dimensions are multiples of 28.

Examples:

Images with dimensions 224 * 448, 1024 x 1024, and 3172 x 4096 consume 256 tokens when detail=low.
An image with dimensions 224 * 448 consumes (224/28) * (448/28) = 8 * 16 = 128 tokens when detail=high.
An image with dimensions 1024 * 1024 is rounded to 1036 * 1036 and consumes (1036/28) * (1036/28) = 1369 tokens when detail=high.
An image with dimensions 3172 * 4096 is resized to 3136 * 4060 and consumes (3136/28) * (4060/28) = 16240 tokens when detail=high.

4.2 DeepseekVL2 Series

Rules: For each image, DeepseekVL2 processes two parts: global_view and local_view. The global_view resizes the original image to 384x384, while the local_view divides the image into blocks of 384x384. Additional tokens are added between blocks to maintain continuity.

When detail=low, all images are resized to 384x384.
When detail=high, images are resized to dimensions that are multiples of 384, ensuring 1 <= h * w <= 9.

The scaling dimensions (h, w) are chosen based on:
- Both h and w are integers, and 1 <= h * w <= 9.
- The resized image’s pixel count is compared to the original image’s pixel count, minimizing the difference.
Token consumption is calculated as:
- (h * w + 1) * 196 + (w + 1) * 14 + 1 tokens.

Examples:

Images with dimensions 224 x 448, 1024 x 1024, and 2048 x 4096 consume 421 tokens when detail=low.
An image with dimensions 384 x 768 consumes (1 * 2 + 1) * 196 + (2 + 1) * 14 + 1 = 631 tokens when detail=high.
An image with dimensions 1024 x 1024 is resized to 1152 x 1152 and consumes (3 * 3 + 1) * 196 + (3 + 1) * 14 + 1 = 2017 tokens when detail=high.
An image with dimensions 2048 x 4096 is resized to 768 x 1536 and consumes (2 * 4 + 1) * 196 + (4 + 1) * 14 + 1 = 1835 tokens when detail=high.
Images with dimensions 224 * 448, 1024 * 1024, and 2048 * 4096, when detail=low is selected, will consume 256 tokens each;
An image with dimensions 224 * 448, when detail=high is selected, has an aspect ratio of 1:2, and will be resized to 448 x 896. At this point, h = 1, w = 2, consuming (h * w + 1) * 256 = 768 tokens;
An image with dimensions 1024 * 1024, when detail=high is selected, has an aspect ratio of 1:1, and will be resized to 1344 * 1344 (h = w = 3). Since 1024 * 1024 > 0.5 * 1344 * 1344, at this point, h = w = 3, consuming (3 * 3 + 1) * 256 = 2560 tokens;
An image with dimensions 2048 * 4096, when detail=high is selected, has an aspect ratio of 1:2, and under the condition 1 <= h * w <= 12, the largest (h, w) combination is h = 2, w = 4. Therefore, it will be resized to 896 * 1792, consuming (2 * 4 + 1) * 256 = 2304 tokens. */}

4.2 DeepseekVL2 series

Rules: DeepseekVL2 processes each image into two parts: global_view and local_view. global_view resizes the original image to 384*384pixels, while local_view divides the image into multiple 384*384 blocks. Additional tokens are added to connect the blocks based on the width.

When detail=low, all images will be resized to 384*384 pixels.
When detail=high, the images will be resized to dimensions that are multiples of 384(OpenAI uses 512), (h*384) * (w * 384), and 1 <= h*w <= 9.

The scaling dimensions h * w will be chosen according to the following rules:
- Both h and w are integers, and within the constraint 1 <= h*w <= 9, traverse the combinations of (h, w).
- Resize the image to (h*384, w*384) pixels and compare with the original image’s pixels. Take the minimum value between the new image’s pixels and the original image’s pixels as the effective pixel value. Take the difference between the original image’s pixels and the effective pixel value as the invalid pixel value. If the effective pixel value exceeds the previously determined effective pixel value, or if the effective pixel value is the same but the invalid pixel value is smaller, choose the current (h*384, w*384) combination.
- Token consumption will follow the following rules:
  - (h*w + 1) * 196 + (w+1) * 14 + 1 token

Examples:

Images with dimensions 224 x 448, 1024 x 1024, and 2048 x 4096, when detail=low is selected, will consume 421 tokens each.
An image with dimensions 384 x 768, when detail=high is selected, has an aspect ratio of 1:1 and will be resized to 384 x 768. At this point, h=1, w=2, consuming (1*2 + 1) * 196 + (2+1) * 14 + 1 = 631 tokens.
An image with dimensions 1024 x 1024, when detail=high is selected, will be resized to 1152*1152(h=w=3), consuming (3*3 + 1) * 196 + (3+1) * 14 + 1 = 2017 tokens.
An image with dimensions 2048 x 4096, when detail=high is selected, has an aspect ratio of 1:2 and will be resized to 768*1536(h=2, w=4), consuming (2*4 + 1) * 196 + (4+1) * 14 + 1 = 1835 tokens.

4.3 GLM-4.1V-9B-Thinking

Rules: GLM-4.1V supports a minimum pixel size of 28 * 28, scaling image dimensions proportionally to the nearest integer multiple of 28 pixels. If the scaled pixel size is smaller than 112 * 112 or larger than 4816894, adjust the dimensions proportionally to fit within the range while maintaining multiples of 28.

detail=low: Resize all images to 448*448 pixels, resulting in 256 tokens.
detail=high: Scale proportionally by first rounding the dimensions to the nearest 28-pixel multiple, then adjusting to fit within the pixel range (12544, 4816894)while ensuring both dimensions remain multiples of 28.

Examples:

224 x 448, 1024 x 1024, 3172 x 4096: With detail=low, all consume 256 tokens.
224 x 448: With detail=high, since dimensions are within range and multiples of 28, tokens = (224//28) * (448//28) = 8 * 16 = 128 tokens.
1024 x 1024: With detail=high, dimensions are rounded to 1036*1036 (within range), tokens = (1036//28) * (1036//28) = 1369 tokens.
3172 x 4096: With detail=high, rounded to 3192 x 4088 (exceeds maximum), then scaled proportionally to 1932 x 2464, tokens = (1932//28) * (2464//28) = 6072 tokens.

5. Usage example

5.1. Example 1 image understanding

import json  
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY", # Obtain from https://cloud.siliconflow.com/account/ak
    base_url="https://api.siliconflow.com/v1"
)

response = client.chat.completions.create(
        model="Qwen/Qwen2-VL-72B-Instruct",
        messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas.s3.us-east-1.amazonaws.com/images/recu6XreBFQ0st.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe the image."
                }
            ]
        }],
        stream=True
)

for chunk in response:
    chunk_message = chunk.choices[0].delta.content
    print(chunk_message, end='', flush=True)

5.2 Example 2: Multi-image Understanding

import json  
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY", # Obtain from https://cloud.siliconflow.com/account/ak
    base_url="https://api.siliconflow.c/v1"
)

response = client.chat.completions.create(
        model="Qwen/Qwen2-VL-72B-Instruct",
        messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas.s3.us-east-1.amazonaws.com/images/recu6XreBFQ0st.png"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas.s3.us-east-1.amazonaws.com/images/recu6Xrf2Cd0cn.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Identify the similarities between these images."
                }
            ]
        }],
        stream=True
)

for chunk in response:
    chunk_message = chunk.choices[0].delta.content
    print(chunk_message, end='', flush=True)

GET STARTED

Capabilities

Features

1. Usage Scenarios

2. Usage Method

2.1 Explanation of Image Detail Control Parameters

2.2 Example Formats for `message` Containing Images

2.2.1 Using Image URLs

2.2.2 Base64 Format

2.2.3 Multiple Images, Each in Either Format

3. Supported Models

4. Billing for Visual Input Content

4.1 Qwen Series

4.2 DeepseekVL2 Series

4.2 DeepseekVL2 series

4.3 GLM-4.1V-9B-Thinking

5. Usage example

5.1. Example 1 image understanding

5.2 Example 2: Multi-image Understanding

GET STARTED

Capabilities

Features

​1. Usage Scenarios

​2. Usage Method

​2.1 Explanation of Image Detail Control Parameters

​2.2 Example Formats for message Containing Images

​2.2.1 Using Image URLs

​2.2.2 Base64 Format

​2.2.3 Multiple Images, Each in Either Format

​3. Supported Models

​4. Billing for Visual Input Content

​4.1 Qwen Series

​4.2 DeepseekVL2 Series

​4.2 DeepseekVL2 series

​4.3 GLM-4.1V-9B-Thinking

​5. Usage example

​5.1. Example 1 image understanding

​5.2 Example 2: Multi-image Understanding

1. Usage Scenarios

2. Usage Method

2.1 Explanation of Image Detail Control Parameters

2.2 Example Formats for `message` Containing Images

2.2.1 Using Image URLs

2.2.2 Base64 Format

2.2.3 Multiple Images, Each in Either Format

3. Supported Models

4. Billing for Visual Input Content

4.1 Qwen Series

4.2 DeepseekVL2 Series

4.2 DeepseekVL2 series

4.3 GLM-4.1V-9B-Thinking

5. Usage example

5.1. Example 1 image understanding

5.2 Example 2: Multi-image Understanding