RKLLM Usage and Deploy LLM

This document explains how to deploy large language models in Huggingface format to the RK3588 with NPU for hardware-accelerated inference using RKLLM.

Currently Supported Models

This guide uses TinyLLAMA 1.1B as an example to show how to deploy a large language model from scratch on a development board equipped with the RK3588 chip and use the NPU for hardware-accelerated inference.

tip

If the RKLLM environment is not installed and configured, please refer to RKLLM Installation.

Model Conversion

Using TinyLLAMA 1.1B as an example, users can also choose any of the links in the currently supported models list.

Download all files of TinyLLAMA 1.1B on an x86 PC workstation. If git-lfs is not installed, please install it.
```
git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
Activate the rkllm conda environment. Refer to RKLLM conda Installation if needed.
```
conda activate rkllm
```

Change the model path and rkllm export path in rknn-llm/rkllm-toolkit/examples/huggingface/test.py.

modelpath = 'Your Huggingface LLM model'
ret = llm.export_rkllm("./Your_Huggingface_LLM_model.rkllm")

Run the model conversion script.
```
cd rknn-llm/rkllm-toolkit/examples/huggingface
python3 test.py
```
After successful conversion, you will get an rkllm model.

Compile Executable File

Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu.

Modify the main program code. Here are two changes:

param.num_npu_core = 3; // The value range for rk3588 num_npu_core is [1,3]
string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
// string text = input_str;

Modify the gcc path in the rknn-llm/rkllm-runtime/examples/rkllm_api_demo/build-linux.sh compilation script.
```
GCC_COMPILER_PATH=gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
```
Run the model conversion script.
```
cd rknn-llm/rkllm-runtime/examples/rkllm_api_demo
bash build-linux.sh
```
The generated executable file is located in build/build_linux_aarch64_Release/llm_demo.

Board Deployment

Local Terminal Mode

Copy the converted rkllm model and the compiled binary file llm_demo to the board.

Import environment variables.

ulimit -n 102400
export LD_LIBRARY_PATH=rknn-llm/rkllm-runtime/runtime/Linux/librkllm_api/aarch64:$LD_LIBRARY_PATH

Run llm_demo and enter exit to quit.
```
taskset f0 ./llm_demo your_rkllm_path
```

Gradio Mode

Server Side

Install gradio.
```
pip3 install gradio
```

Copy librkllmrt.so to rkllm_server/lib.

cd rknn-llm/rkllm-runtime
cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so  ./examples/rkllm_server_demo/rkllm_server/lib

Modify gradio_server.py to disable GPU for prefill acceleration.
```
rknnllm_param.use_gpu = False
```

Start the gradio server.

cd examples/rkllm_server_demo/rkllm_server
python3 gradio_server.py --target_platform rk3588 --rkllm_model_path your_model_path

Access the development board's IP port 8080 in your browser.

Client Side

After starting the gradio server on the development board, users can call the LLM gradio server through the Gradio API on other devices in the same network environment.

Install gradio_client.
```
pip3 install gradio_client
```
Modify the IP address in chat_api_gradio.py. Users need to adjust this according to their deployment's specific address.
```
# Users need to modify according to their deployment's specific IP
client = Client("http://192.168.2.209:8080/")
```

Run chat_api_gradio.py.

cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
python3 chat_api_gradio.py

Flask Mode

Server Side

Install flask.

pip3 install flask==2.2.2 Werkzeug==2.2.2

Copy librkllmrt.so to rkllm_server/lib.

cd rknn-llm/rkllm-runtime
cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so  ./examples/rkllm_server_demo/rkllm_server/lib

Modify flask_server.py to disable GPU for prefill acceleration.
```
rknnllm_param.use_gpu = False
```

Start the flask server on port 8080.

cd examples/rkllm_server_demo/rkllm_server
python3 flask_server.py --target_platform rk3588 --rkllm_model_path your_model_path

Client Side

After starting the flask server on the development board, users can call the flask server through the flask API on other devices in the same network environment. Users can refer to this API access example to develop custom functions, using the corresponding send/receive structures for data packaging and parsing.

Modify the IP address in chat_api_flask.py. Users need to adjust this according to their deployment's specific address.

# Users need to modify according to their deployment's specific IP
server_url = 'http://192.168.2.209:8080/rkllm_chat'

Run chat_api_flask.py.

cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
python3 chat_api_flask.py

Performance Comparison of Some Models

Model	Parameter Size	Chip	Chip Count	Inference Speed
TinyLlama	1.1B	RK3588	1	15.03 token/s
Qwen	1.8B	RK3588	1	14.18 token/s
Phi3	3.8B	RK3588	1	6.46 token/s
ChatGLM3	6B	RK3588	1	3.67 token/s

RKLLM Usage and Deploy LLM

Currently Supported Models​

Model Conversion​

Compile Executable File​

Board Deployment​

Local Terminal Mode​

Gradio Mode​

Server Side​

Client Side​

Flask Mode​

Server Side​

Client Side​

Performance Comparison of Some Models​

Currently Supported Models

Model Conversion

Compile Executable File

Board Deployment

Local Terminal Mode

Gradio Mode

Server Side

Client Side

Flask Mode

Server Side

Client Side

Performance Comparison of Some Models