CLIP

CLIP 是由 OpenAI 开发的通用多模态预训练模型。它通过在互联网上收集的数亿对“图像-文本”数据上进行对比学习，打破了传统视觉模型依赖人工手动标注类别的局限，赋予了人工智能通过自然语言直接“理解”视觉世界的能力。

核心特点：具备强大的跨模态对齐能力与零样本（Zero-shot）迁移能力，无需针对特定任务微调即可识别从未见过的物体类别。它广泛应用于语义图文检索、自动提示词生成，并作为 Stable Diffusion 等生成式 AI 的核心文本编码器。
版本说明：本案例采用 CLIP-ViT-B/32 模型。作为该系列中兼顾性能与部署效率的基准版本，它采用 Vision Transformer (ViT) 作为视觉主干网络，通过 32x32 的补丁切片处理图像特征。它在保持优秀语义对齐精度的同时，拥有更轻量的参数规模和更快的推理响应速度，是目前多模态应用落地中的主流平衡选择。

环境配置

需要提前配置好相关环境。

快速开始

下载模型文件

O6 / O6N

cd ai_model_hub_25_Q3/models/Generative_AI/Image_to_Text/onnx_clip
wget -O clip_txt.cix https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/clip_txt.cix
wget -O clip_visual.cix https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/clip_visual.cix

模型测试

信息

运行前激活虚拟环境！

O6 / O6N

python3 inference_npu.py

完整转换流程

下载模型文件

Linux PC

cd ai_model_hub_25_Q3/models/Generative_AI/Image_to_Text/onnx_clip/model
wget -O clip_text_model_vitb32.onnx https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/model/clip_text_model_vitb32.onnx
wget -O clip_visual.onnx https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Generative_AI/Image_to_Text/onnx_clip/model/clip_visual.onnx

项目结构

├── cfg
├── clip_visual.cix
├── clip_txt.cix
├── datasets
├── inference_npu.py
├── inference_onnx.py
├── model
├── ReadMe.md
└── test_data

进行模型量化和转换

转换图像模块

Linux PC

cd ..
cixbuild cfg/clip_visualbuild.cfg

转换文本模块

Linux PC

cixbuild cfg/clip_text_model_vitb32build.cfg

推送到板端

完成模型转换之后需要将 cix 模型文件推送到板端。

测试主机推理

运行推理脚本

Linux PC

python3 inference_onnx.py

模型推理结果

Linux PC

$ python3 inference_onnx.py
[[0.03632354 0.96057177 0.00310465]]
test_data/000000464522.jpg, max similarity: a dog
[[0.03074941 0.00429748 0.9649532 ]]
test_data/000000032811.jpg, max similarity: a bird
[[0.8280978  0.08798673 0.08391542]]
test_data/000000010698.jpg, max similarity: a person

测试图片

进行 NPU 部署

运行推理脚本

O6 / O6N

python3 inference_npu.py

模型运行结果

O6 / O6N

$ python3 inference_npu.py
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
[[0.09763492 0.00929287 0.89307225]]
test_data/000000032811.jpg, max similarity: a bird
[[0.02777621 0.9682566  0.00396715]]
test_data/000000464522.jpg, max similarity: a dog
[[0.8495277  0.08247717 0.06799505]]
test_data/000000010698.jpg, max similarity: a person
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success

测试图片

同上。

快速开始​

下载模型文件​

模型测试​

完整转换流程​

下载模型文件​

项目结构​

进行模型量化和转换​

转换图像模块​

转换文本模块​

测试主机推理​

运行推理脚本​

模型推理结果​

测试图片​

进行 NPU 部署​

运行推理脚本​

模型运行结果​

测试图片​

快速开始

下载模型文件

模型测试

完整转换流程

下载模型文件

项目结构

进行模型量化和转换

转换图像模块

转换文本模块

测试主机推理

运行推理脚本

模型推理结果

测试图片

进行 NPU 部署

运行推理脚本

模型运行结果

测试图片