CLIP
Environment Setup
info
Follow RKNN Installation to set up the environment.
Follow RKNN Model Zoo to download the example files.
Model Download
Download the ONNX model file.
X64 Linux PC
cd rknn_model_zoo/examples/clip/model/
bash download_model.sh
Model Conversion
Select the target platform.
- rk3588
- rk356x
- rk3576
X64 Linux PC
export TARGET_PLATFORM=rk3588
X64 Linux PC
export TARGET_PLATFORM=rk356x
X64 Linux PC
export TARGET_PLATFORM=rk3576
Convert the ONNX model to an RKNN model.
X64 Linux PC
cd ../python/images/
python convert.py ../../model/clip_images.onnx ${TARGET_PLATFORM}
cd ../text/
python convert.py ../../model/clip_text.onnx ${TARGET_PLATFORM}
C API
Build the Example
Go to the rknn_model_zoo directory and run build-linux.sh to build.
X64 Linux PC
cd ../../../..
bash build-linux.sh -t ${TARGET_PLATFORM} -a aarch64 -d clip
Sync Files to the Device
Copy the built demo directory under the install folder to the device.
X64 Linux PC
cd install/${TARGET_PLATFORM}_linux_aarch64/
scp -r rknn_clip_demo/ user@your_device_ip:target_directory
Run the Example
Export the runtime libraries to the environment variable.
Device
cd rknn_clip_demo/
export LD_LIBRARY_PATH=./lib
Run the example.
Device
./rknn_clip_demo ./model/clip_images.rknn ./model/dog_224x224.jpg ./model/clip_text.rknn ./model/text.txt
$ ./rknn_clip_demo ./model/clip_images.rknn ./model/dog_224x224.jpg ./model/clip_text.rknn ./model/text.txt
--> init clip image model
model input num: 1, output num: 1
input tensors:
index=0, name=pixel_values, n_dims=4, dims=[1, 224, 224, 3], n_elems=150528, size=301056, fmt=NHWC, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=image_embeds, n_dims=2, dims=[1, 512], n_elems=512, size=1024, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
model is NHWC input fmt
input image height=224, input image width=224, input image channel=3
--> init clip text model
model input num: 1, output num: 1
input tensors:
index=0, name=input_ids, n_dims=2, dims=[1, 20], n_elems=20, size=160, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=text_embeds, n_dims=2, dims=[1, 512], n_elems=512, size=1024, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
model is UNDEFINED input fmt
input text batch size=1, input sequence length=20
origin size=224x224 crop size=224x224
input image: 224 x 224, subsampling: 4:2:0, colorspace: YCbCr, orientation: 1
num_lines=2
--> inference clip image model
rga_api version 1.10.1_[0]
rknn_run
--> inference clip text model
rknn_run
rknn_run
--> rknn clip demo result
images: ./model/dog_224x224.jpg
text : a photo of a dog
score : 0.989
Test Image
Python API
Activate the virtual environment
Device
conda activate rknn
Run the Example
Copy the related files to the device and run the following commands.
Device
python clip.py --img_model ../model/clip_images.rknn --text_model ../model/clip_text.rknn --target ${TARGET_PLATFORM}
$ python clip.py --img_model ../model/clip_images.rknn --text_model ../model/clip_text.rknn --target rk3588
/home/radxa/miniforge3/envs/rknn/lib/python3.12/site-packages/rknn/api/rknn.py:51: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
self.rknn_base = RKNNBase(cur_path, verbose)
I rknn-toolkit2 version: 2.3.2
I target set by user is: rk3588
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
I rknn-toolkit2 version: 2.3.2
I target set by user is: rk3588
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
--> rknn clip demo result:
images: ../model/dog_224x224.jpg
text : a photo of dog
score : 0.990