RKLLM Usage and Deploy LLM
This document explains how to deploy large language models in Huggingface format to the RK3588 with NPU for hardware-accelerated inference using RKLLM.
Currently Supported Models
- LLAMA models
- TinyLLAMA models
- Qwen models
- Phi models
- ChatGLM3-6B
- Gemma models
- InternLM2 models
- MiniCPM models
- TeleChat models
- Qwen2-VL
- MiniCPM-V
This guide uses Qwen2.5-1.5B-Instruct as an example to show how to deploy a large language model from scratch on a development board equipped with the RK3588 chip and use the NPU for hardware-accelerated inference.
tip
If the RKLLM environment is not installed and configured, please refer to RKLLM Installation.
Model Conversion
Using Qwen2.5-1.5B-Instruct as an example, users can also choose any of the links in the currently supported models list.
- Download all files of Qwen2.5-1.5B-Instruct on an x86 PC workstation. If git-lfs is not installed, please install it.
git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
- Activate the rkllm conda environment. Refer to RKLLM conda Installation if needed.
conda activate rkllm
- Change modelpath model path, dataset path, rkllm export path in
rknn-llm/rkllm-toolkit/examples/test.py
.15 modelpath = 'Your Huggingface LLM model'
29 datasert = None # 默认是 "./data_quant.json", 如无可以填写 None
83 ret = llm.export_rkllm("./Your_Huggingface_LLM_model.rkllm") - Run the model conversion script.
After successful conversion, you will get an rkllm model.
cd rknn-llm/rkllm-toolkit/examples/
python3 test.py
Compile Executable File
- Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu.
- Modify the main program
rknn-llm/examples/rkllm_api_demo/src/llm_demo.cpp
code, change two places here.184 text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
185 // text = input_str; - Modify the GCC_COMPILER_PATH in the
rknn-llm/examples/rkllm_api_demo/build-linux.sh
compilation script.GCC_COMPILER_PATH=gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
- Run the model conversion script.
The generated executable file is located in
cd rknn-llm/examples/rkllm_api_demo/
bash build-linux.shbuild/build_linux_aarch64_Release/llm_demo
.
Board Deployment
Terminal Mode
- Copy the converted rkllm model and the compiled binary file llm_demo to the board.
- Import environment variables.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:rknn-llm/rkllm-runtime/Linux/librkllm_api/aarch64
- Run llm_demo and enter
exit
to quit.export RKLLM_LOG_LEVEL=1
./llm_demo your_rkllm_path 10000 10000
Performance Comparison of Models
Model | Parameter Size | Chip | Chip Count | Inference Speed |
---|---|---|---|---|
TinyLlama | 1.1B | RK3588 | 1 | 15.03 token/s |
Qwen | 1.8B | RK3588 | 1 | 14.18 token/s |
Phi3 | 3.8B | RK3588 | 1 | 6.46 token/s |
ChatGLM3 | 6B | RK3588 | 1 | 3.67 token/s |