Whisper

Environment Setup

info

Follow RKNN Installation to set up the environment.

Follow RKNN Model Zoo to download the example files.

Model Download

Download the ONNX model file.

X64 Linux PC

cd rknn_model_zoo/examples/whisper/model/
bash download_model.sh

Model Conversion

Select the target platform.

rk3588
rk356x
rk3576

X64 Linux PC

export TARGET_PLATFORM=rk3588

X64 Linux PC

export TARGET_PLATFORM=rk356x

X64 Linux PC

export TARGET_PLATFORM=rk3576

Convert the ONNX model to an RKNN model.

X64 Linux PC

cd ../python/
python convert.py ../model/whisper_encoder_base_20s.onnx ${TARGET_PLATFORM}
python convert.py ../model/whisper_decoder_base_20s.onnx ${TARGET_PLATFORM}

C API

Build the Example

Go to the rknn_model_zoo directory and run build-linux.sh to build.

X64 Linux PC

cd ../../..
bash build-linux.sh -t ${TARGET_PLATFORM} -a aarch64 -d whisper

Sync Files to the Device

Copy the built demo directory under the install folder to the device.

X64 Linux PC

cd install/${TARGET_PLATFORM}_linux_aarch64/
scp -r rknn_whisper_demo/ user@your_device_ip:target_directory

Run the Example

Export the runtime libraries to the environment variable.

Device

cd rknn_whisper_demo/
export LD_LIBRARY_PATH=./lib

Run the example.

Device

# Chinese audio
./rknn_whisper_demo ./model/whisper_encoder_base_20s.rknn ./model/whisper_decoder_base_20s.rknn zh ./model/test_zh.wav
# English audio
./rknn_whisper_demo ./model/whisper_encoder_base_20s.rknn ./model/whisper_decoder_base_20s.rknn en ./model/test_en.wav

Chinese audio:

$ ./rknn_whisper_demo ./model/whisper_encoder_base_20s.rknn ./model/whisper_decoder_base_20s.rknn zh ./model/test_zh.wav
-- read_audio & convert_channels & resample_audio use: 6.659000 ms
-- read_mel_filters & read_vocab use: 54.120998 ms
model input num: 1, output num: 1
input tensors:
  index=0, name=x, n_dims=3, dims=[1, 80, 2000], n_elems=160000, size=320000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
  index=0, name=out, n_dims=3, dims=[1, 1000, 512], n_elems=512000, size=1024000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_whisper_encoder_model use: 199.550995 ms
model input num: 2, output num: 1
input tensors:
  index=0, name=tokens, n_dims=2, dims=[1, 12], n_elems=12, size=96, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
  index=1, name=audio, n_dims=3, dims=[1, 1000, 512], n_elems=512000, size=1024000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
  index=0, name=out, n_dims=3, dims=[1, 12, 51865], n_elems=622380, size=1244760, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_whisper_decoder_model use: 282.627014 ms
-- inference_whisper_model use: 1656.614014 ms

Whisper output: He introduced me, and I want to say that if you are interested in my research

Real Time Factor (RTF): 1.657 / 5.611 = 0.295

English audio:

$ ./rknn_whisper_demo ./model/whisper_encoder_base_20s.rknn ./model/whisper_decoder_base_20s.rknn en ./model/test_en.wav
-- read_audio & convert_channels & resample_audio use: 2.198000 ms
-- read_mel_filters & read_vocab use: 60.438000 ms
model input num: 1, output num: 1
input tensors:
  index=0, name=x, n_dims=3, dims=[1, 80, 2000], n_elems=160000, size=320000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
  index=0, name=out, n_dims=3, dims=[1, 1000, 512], n_elems=512000, size=1024000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_whisper_encoder_model use: 121.598999 ms
model input num: 2, output num: 1
input tensors:
  index=0, name=tokens, n_dims=2, dims=[1, 12], n_elems=12, size=96, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
  index=1, name=audio, n_dims=3, dims=[1, 1000, 512], n_elems=512000, size=1024000, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
  index=0, name=out, n_dims=3, dims=[1, 12, 51865], n_elems=622380, size=1244760, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_whisper_decoder_model use: 222.567993 ms
-- inference_whisper_model use: 1372.854980 ms

Whisper output:  Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.

Real Time Factor (RTF): 1.373 / 5.855 = 0.234

Python API

Activate the virtual environment

Device

conda activate rknn

Run the Example

info

Dependency note: Install dependencies with the command below.

pip install soundfile

Copy the related files to the device and run the following commands.

Device

# Chinese audio
python whisper.py --encoder_model_path ../model/whisper_encoder_base_20s.rknn --decoder_model_path ../model/whisper_decoder_base_20s.rknn --task zh --audio_path ../model/test_zh.wav --target ${TARGET_PLATFORM}
# English audio
python whisper.py --encoder_model_path ../model/whisper_encoder_base_20s.rknn --decoder_model_path ../model/whisper_decoder_base_20s.rknn --task en --audio_path ../model/test_en.wav --target ${TARGET_PLATFORM}

Chinese audio:

$ python whisper.py --encoder_model_path ../model/whisper_encoder_base_20s.rknn --decoder_model_path ../model/whisper_decoder_base_20s.rknn --task zh --audio_path ../model/test_zh.wav --target rk3588
2026-01-16 08:54:55.503119681 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
/home/radxa/miniforge3/envs/rknn/lib/python3.12/site-packages/rknn/api/rknn.py:51: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  self.rknn_base = RKNNBase(cur_path, verbose)
I rknn-toolkit2 version: 2.3.2
--> Loading model
done
--> Init runtime environment
I target set by user is: rk3588
done
I rknn-toolkit2 version: 2.3.2
--> Loading model
done
--> Init runtime environment
I target set by user is: rk3588
done
W inference: Inputs should be placed in a list, like [img1, img2], both the img1 and img2 are ndarray.
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!

Whisper output: He introduced me, and I want to say that if you are interested in my research

English audio:

$ python whisper.py --encoder_model_path ../model/whisper_encoder_base_20s.rknn --decoder_model_path ../model/whisper_decoder_base_20s.rknn --task en --audio_path ../model/test_en.wav --target rk3588
2026-01-16 08:54:35.451693658 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
/home/radxa/miniforge3/envs/rknn/lib/python3.12/site-packages/rknn/api/rknn.py:51: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  self.rknn_base = RKNNBase(cur_path, verbose)
I rknn-toolkit2 version: 2.3.2
--> Loading model
done
--> Init runtime environment
I target set by user is: rk3588
done
I rknn-toolkit2 version: 2.3.2
--> Loading model
done
--> Init runtime environment
I target set by user is: rk3588
done
W inference: Inputs should be placed in a list, like [img1, img2], both the img1 and img2 are ndarray.
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!

Whisper output:  Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.

Environment Setup​

Model Download​

Model Conversion​

C API​

Build the Example​

Sync Files to the Device​

Run the Example​

Python API​

Activate the virtual environment​

Run the Example​

Environment Setup

Model Download

Model Conversion

C API

Build the Example

Sync Files to the Device

Run the Example

Python API

Activate the virtual environment

Run the Example