Skip to main content

RKLLM Usage and Deploy LLM

This document explains how to deploy large language models in Huggingface format to the RK3588 with NPU for hardware-accelerated inference using RKLLM.

Currently Supported Models

This guide uses TinyLLAMA 1.1B as an example to show how to deploy a large language model from scratch on a development board equipped with the RK3588 chip and use the NPU for hardware-accelerated inference.

tip

If the RKLLM environment is not installed and configured, please refer to RKLLM Installation.

Model Conversion

Using TinyLLAMA 1.1B as an example, users can also choose any of the links in the currently supported models list.

  • Download all files of TinyLLAMA 1.1B on an x86 PC workstation. If git-lfs is not installed, please install it.
    git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Activate the rkllm conda environment. Refer to RKLLM conda Installation if needed.
    conda activate rkllm
  • Change the model path and rkllm export path in rknn-llm/rkllm-toolkit/examples/huggingface/test.py.
    modelpath = 'Your Huggingface LLM model'
    ret = llm.export_rkllm("./Your_Huggingface_LLM_model.rkllm")
  • Run the model conversion script.
    cd rknn-llm/rkllm-toolkit/examples/huggingface
    python3 test.py
    After successful conversion, you will get an rkllm model.

Compile Executable File

  • Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu.
  • Modify the main program code. Here are two changes:
    74 param.num_npu_core = 3; // The value range for rk3588 num_npu_core is [1,3]
    118 string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
    119 // string text = input_str;
  • Modify the gcc path in the rknn-llm/rkllm-runtime/examples/rkllm_api_demo/build-linux.sh compilation script.
    GCC_COMPILER_PATH=gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
  • Run the model conversion script.
    cd rknn-llm/rkllm-runtime/examples/rkllm_api_demo
    bash build-linux.sh
    The generated executable file is located in build/build_linux_aarch64_Release/llm_demo.

Board Deployment

Local Terminal Mode

  • Copy the converted rkllm model and the compiled binary file llm_demo to the board.
  • Import environment variables.
    ulimit -n 102400
    export LD_LIBRARY_PATH=rknn-llm/rkllm-runtime/runtime/Linux/librkllm_api/aarch64:$LD_LIBRARY_PATH
  • Run llm_demo and enter exit to quit.
    taskset f0 ./llm_demo your_rkllm_path
    rkllm_2.webp

Gradio Mode

Server Side
  • Install gradio.
    pip3 install gradio
  • Copy librkllmrt.so to rkllm_server/lib.
    cd rknn-llm/rkllm-runtime
    cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so ./examples/rkllm_server_demo/rkllm_server/lib
  • Modify gradio_server.py to disable GPU for prefill acceleration.
    rknnllm_param.use_gpu = False
  • Start the gradio server.
    cd examples/rkllm_server_demo/rkllm_server
    python3 gradio_server.py --target_platform rk3588 --rkllm_model_path your_model_path
  • Access the development board's IP port 8080 in your browser. rkllm_3.webp
Client Side

After starting the gradio server on the development board, users can call the LLM gradio server through the Gradio API on other devices in the same network environment.

  • Install gradio_client.
    pip3 install gradio_client
  • Modify the IP address in chat_api_gradio.py. Users need to adjust this according to their deployment's specific address.
    # Users need to modify according to their deployment's specific IP
    client = Client("http://192.168.2.209:8080/")
  • Run chat_api_gradio.py.
    cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
    python3 chat_api_gradio.py
    rkllm_4.webp

Flask Mode

Server Side
  • Install flask.
    pip3 install flask==2.2.2 Werkzeug==2.2.2
  • Copy librkllmrt.so to rkllm_server/lib.
    cd rknn-llm/rkllm-runtime
    cp ./runtime//Linux/librkllm_api/aarch64/librkllmrt.so ./examples/rkllm_server_demo/rkllm_server/lib
  • Modify flask_server.py to disable GPU for prefill acceleration.
    rknnllm_param.use_gpu = False
  • Start the flask server on port 8080.
    cd examples/rkllm_server_demo/rkllm_server
    python3 flask_server.py --target_platform rk3588 --rkllm_model_path your_model_path
    rkllm_5.webp
Client Side

After starting the flask server on the development board, users can call the flask server through the flask API on other devices in the same network environment. Users can refer to this API access example to develop custom functions, using the corresponding send/receive structures for data packaging and parsing.

  • Modify the IP address in chat_api_flask.py. Users need to adjust this according to their deployment's specific address.
    # Users need to modify according to their deployment's specific IP
    server_url = 'http://192.168.2.209:8080/rkllm_chat'
  • Run chat_api_flask.py.
    cd rknn-llm/rkllm-runtime/examples/rkllm_server_demo
    python3 chat_api_flask.py
    rkllm_6.webp

Performance Comparison of Some Models

ModelParameter SizeChipChip CountInference Speed
TinyLlama1.1BRK3588115.03 token/s
Qwen1.8BRK3588114.18 token/s
Phi33.8BRK358816.46 token/s
ChatGLM36BRK358813.67 token/s