RKLLM Usage and Deploy LLM
This document explains how to deploy large language models in Huggingface format to the RK3588 with NPU for hardware-accelerated inference using RKLLM.
Currently Supported Models
- TinyLLAMA 1.1B
- Qwen 1.8B
- Qwen2 0.5B
- Phi-2 2.7B
- Phi-3 3.8B
- ChatGLM3 6B
- Gemma 2B
- InternLM2 1.8B
- MiniCPM 2B
This guide uses TinyLLAMA 1.1B as an example to show how to deploy a large language model from scratch on a development board equipped with the RK3588 chip and use the NPU for hardware-accelerated inference.
If the RKLLM environment is not installed and configured, please refer to RKLLM Installation.
Model Conversion
Using TinyLLAMA 1.1B as an example, users can also choose any of the links in the currently supported models list.
- Download all files of TinyLLAMA 1.1B on an x86 PC workstation. If git-lfs is not installed, please install it.
git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Activate the rkllm conda environment. Refer to RKLLM conda Installation if needed.
conda activate rkllm
- Change the model path and rkllm export path in
rknn-llm/rkllm-toolkit/examples/huggingface/test.py
.modelpath = 'Your Huggingface LLM model'
ret = llm.export_rkllm("./Your_Huggingface_LLM_model.rkllm") - Run the model conversion script.
After successful conversion, you will get an rkllm model.
cd rknn-llm/rkllm-toolkit/examples/huggingface
python3 test.py
Compile Executable File
- Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu.
- Modify the main program code. Here are two changes:
74 param.num_npu_core = 3; // The value range for rk3588 num_npu_core is [1,3]
118 string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
119 // string text = input_str; - Modify the gcc path in the
rknn-llm/rkllm-runtime/examples/rkllm_api_demo/build-linux.sh
compilation script.GCC_COMPILER_PATH=gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
- Run the model conversion script.
The generated executable file is located in
cd rknn-llm/rkllm-runtime/examples/rkllm_api_demo
bash build-linux.shbuild/build_linux_aarch64_Release/llm_demo
.
Board Deployment
Local Terminal Mode
- Copy the converted rkllm model and the compiled binary file llm_demo to the board.
- Import environment variables.
ulimit -n 102400
export LD_LIBRARY_PATH=rknn-llm/rkllm-runtime/runtime/Linux/librkllm_api/aarch64:$LD_LIBRARY_PATH - Run llm_demo and enter
exit
to quit.taskset f0 ./llm_demo your_rkllm_path
Gradio Mode
Server Side
- Install gradio.
pip3 install gradio
- Copy
librkllmrt.so
torkllm_server/lib
.cd rknn-llm/rkllm-runtime
cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so ./examples/rkllm_server_demo/rkllm_server/lib - Modify gradio_server.py to disable GPU for prefill acceleration.
rknnllm_param.use_gpu = False
- Start the gradio server.
cd examples/rkllm_server_demo/rkllm_server
python3 gradio_server.py --target_platform rk3588 --rkllm_model_path your_model_path - Access the development board's IP port 8080 in your browser.
Client Side
After starting the gradio server on the development board, users can call the LLM gradio server through the Gradio API on other devices in the same network environment.
- Install gradio_client.
pip3 install gradio_client
- Modify the IP address in chat_api_gradio.py. Users need to adjust this according to their deployment's specific address.
# Users need to modify according to their deployment's specific IP
client = Client("http://192.168.2.209:8080/") - Run chat_api_gradio.py.
cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
python3 chat_api_gradio.py
Flask Mode
Server Side
- Install flask.
pip3 install flask==2.2.2 Werkzeug==2.2.2
- Copy
librkllmrt.so
torkllm_server/lib
.cd rknn-llm/rkllm-runtime
cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so ./examples/rkllm_server_demo/rkllm_server/lib - Modify flask_server.py to disable GPU for prefill acceleration.
rknnllm_param.use_gpu = False
- Start the flask server on port 8080.
cd examples/rkllm_server_demo/rkllm_server
python3 flask_server.py --target_platform rk3588 --rkllm_model_path your_model_path
Client Side
After starting the flask server on the development board, users can call the flask server through the flask API on other devices in the same network environment. Users can refer to this API access example to develop custom functions, using the corresponding send/receive structures for data packaging and parsing.
- Modify the IP address in chat_api_flask.py. Users need to adjust this according to their deployment's specific address.
# Users need to modify according to their deployment's specific IP
server_url = 'http://192.168.2.209:8080/rkllm_chat' - Run chat_api_flask.py.
cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
python3 chat_api_flask.py
Performance Comparison of Some Models
Model | Parameter Size | Chip | Chip Count | Inference Speed |
---|---|---|---|---|
TinyLlama | 1.1B | RK3588 | 1 | 15.03 token/s |
Qwen | 1.8B | RK3588 | 1 | 14.18 token/s |
Phi3 | 3.8B | RK3588 | 1 | 6.46 token/s |
ChatGLM3 | 6B | RK3588 | 1 | 3.67 token/s |