CLIP

CLIP is a general-purpose multimodal pre-trained model developed by OpenAI. By performing contrastive learning on hundreds of millions of image-text pairs collected from the Internet, it breaks away from the limitations of traditional vision models that rely on manually labeled categories, enabling AI to “understand” the visual world directly through natural language.

Key features: Strong cross-modal alignment and zero-shot transfer capability. It can recognize object categories it has never seen without task-specific fine-tuning. It is widely used for semantic image-text retrieval, automatic prompt generation, and as the core text encoder for generative AI such as Stable Diffusion.
Version notes: This example uses the CLIP-ViT-B/32 model. As a baseline that balances performance and deployment efficiency, it uses a Vision Transformer (ViT) as the visual backbone and processes image features with 32x32 patches. While maintaining strong semantic alignment accuracy, it has a smaller parameter size and faster inference, making it a common balanced choice for real-world multimodal applications.

Environment setup

You need to set up the environment in advance.

Quick start

Download model files

O6 / O6N

cd ai_model_hub_25_Q3/models/Generative_AI/Image_to_Text/onnx_clip
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/clip_txt.cix
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/clip_visual.cix

Test the model

info

Activate the virtual environment before running.

O6 / O6N

python3 inference_npu.py

Full conversion workflow

Download model files

Linux PC

cd ai_model_hub_25_Q3/models/Generative_AI/Image_to_Text/onnx_clip/model
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/model/clip_text_model_vitb32.onnx
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/model/clip_visual.onnx

Project structure

├── cfg
├── clip_visual.cix
├── clip_txt.cix
├── datasets
├── inference_npu.py
├── inference_onnx.py
├── model
├── ReadMe.md
└── test_data

Quantize and convert the model

Convert the image module

Linux PC

cd ..
cixbuild cfg/clip_visualbuild.cfg

Convert the text module

Linux PC

cixbuild cfg/clip_text_model_vitb32build.cfg

Copy to device

After conversion, copy the .cix model files to the device.

Test inference on the host

Run the inference script

Linux PC

python3 inference_onnx.py

Inference output

Linux PC

$ python3 inference_onnx.py
[[0.03632354 0.96057177 0.00310465]]
test_data/000000464522.jpg, max similarity: a dog
[[0.03074941 0.00429748 0.9649532 ]]
test_data/000000032811.jpg, max similarity: a bird
[[0.8280978  0.08798673 0.08391542]]
test_data/000000010698.jpg, max similarity: a person

Test images

Deploy on NPU

Run the inference script

O6 / O6N

python3 inference_npu.py

Runtime output

O6 / O6N

$ python3 inference_npu.py
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
[[0.09763492 0.00929287 0.89307225]]
test_data/000000032811.jpg, max similarity: a bird
[[0.02777621 0.9682566  0.00396715]]
test_data/000000464522.jpg, max similarity: a dog
[[0.8495277  0.08247717 0.06799505]]
test_data/000000010698.jpg, max similarity: a person
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success

Test images

Same as above.

Quick start​

Download model files​

Test the model​

Full conversion workflow​

Download model files​

Project structure​

Quantize and convert the model​

Convert the image module​

Convert the text module​

Test inference on the host​

Run the inference script​

Inference output​

Test images​

Deploy on NPU​

Run the inference script​

Runtime output​

Test images​

Quick start

Download model files

Test the model

Full conversion workflow

Download model files

Project structure

Quantize and convert the model

Convert the image module

Convert the text module

Test inference on the host

Run the inference script

Inference output

Test images

Deploy on NPU

Run the inference script

Runtime output

Test images