Qwen2 Chatbot-TPU

Qwen2 ChatBot-TPU uses the Sophon SDK to port the Alibaba open-source Qwen2 model to the SG2300X series chips, enabling hardware-accelerated inference using local TPUs. The model is designed as a chatbot using Gradio, allowing users to ask practical questions.

Qwen2 Deployment

Clone the repository

git clone https://github.com/zifeng-radxa/LLM-TPU.git

Open the Qwen2 project path
```
cd LLM-TPU/models/Qwen2/python_demo
```
This example provides the Qwen2-7B-Instruct 4-bit quantized model qwen2-7b_int4_seq512_1dev.bmodel and the C++ precompiled file download

Users can refer to Qwen2 Model Conversion to convert different quantization methods of the Qwen2 model

Users can refer to Qwen2 cpython File Compilation to compile the cpython interface binding file
```
# qwen2-7b_int4_seq512_1dev.bmodel
wget https://github.com/radxa-edge/TPU-Edge-AI/releases/download/qwen2/tar_downloader.sh
bash tar_downloader.sh
tar -xvf qwen2-7b_int4_seq512_1dev.tar.gz
```
Set up the environment

You must create a virtual environment to avoid affecting other applications, please refer to here for virtual environment usage
```
python3 -m virtualenv .venv
source .venv/bin/activate
```

Install dependencies

pip3 install --upgrade pip
pip3 install gradio transformers

Import environment variables

Please use the ldd command to check if the path linked by chat.cpython-38-aarch64-linux-gnu.so to libbmlib.so is LLM-TPU/support/lib_soc/libbmlib.so

If the link path of libbmlib.so is incorrect, you can run the following command, please specify the LLM-TPU path correctly
```
export LD_LIBRARY_PATH=LLM-TPU/support/lib_soc:$LD_LIBRARY_PATH
```
Start Qwen2

(Optional) Modify Qwen2 output language or role-playing, please refer to Modify Qwen2 Background Information

Terminal mode
```
python3 pipeline.py --model_path your_bmodel_path --tokenizer_path ../support/token_config/
```
-m; --model_path: Specify the model path

-t; --tokenizer_path: Specify the token_config folder path, default is ../support/token_config

Gradio mode
```
python3 web_demo.py -m your_bmodel_path -t ../support/token_config/
```
-m; --model_path: Specify the model path

-t; --tokenizer_path: Specify the token_config folder path, default is ../support/token_config

Access the Airbox IP address on port 8003 in your browser

Qwen2 Model Conversion

Users can refer to this document to convert different quantization types of the Qwen2-7B-Instruct model to bmodel

Prepare the environment on an X86 workstation

Please refer to TPU-MLIR Installation to configure the TPU-MLIR environment

Clone the repository
```
git clone https://github.com/zifeng-radxa/LLM-TPU.git
```
Download the Qwen2 open-source model from Huggingface

tip
Ensure you have installed git lfs
```
git clone https://huggingface.co/Qwen/Qwen2-7B-Instruct
```

Create a virtual environment in the working directory LLM-TPU/models/Qwen2

Refer to here for virtual environment usage

python3 -m virtualenv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install transformers_stream_generator einops tiktoken accelerate torch==2.0.1+cpu torchvision==0.15.2 transformers==4.41.2

Align the model environment

Copy LLM-TPU/models/Qwen2/compile/files/Qwen2-7B-Instruct/modeling_llama.py to the transformers library, note that the transformers library should be in .venv
```
cp ./compile/files/Qwen2-7B-Instruct/modeling_llama.py .venv/lib/python3.10/site-packages/transformers/models/qwen2
```
Generate the onnx file
```
cd compile
python export_onnx.py --model_path your_model_path --seq_length 512
```
--model_path: The path to the downloaded Qwen2 folder

--seq_length: The fixed sequence length for export, choose 512, 1024, 2048, etc., as needed
Generate the bmodel file

Exit the virtual environment before generating the bmodel
```
deactivate
```
Compile the model
```
./compile.sh --mode int4 --name qwen2-7b --addr_mode io_alone --seq_length 512 # same as int8
```
--mode: Quantization mode, choose int4, int8

--seq_length: Sequence length, must match the seq_length specified when generating the onnx file

--name: Model name, must be qwen2-7b

--addr_mode: Network address allocation mode, choose io_alone mode

tip
Generating bmodel takes more than 1 hour. It is recommended to have 64G memory and more than 100G disk space to avoid OOM or no space left

Qwen2 cpython File Compilation

Compile the executable file in Airbox, the precompiled file is included in the qwen2-7b_int4_seq512_1dev.tar.gz download package, no need to compile if already downloaded

cd python_demo
mkdir build
cd build && cmake .. && make && cp *cpython* .. && cd ..

Modify Qwen2 Background Information

Users can modify Qwen2's initial background information to change Qwen2 into different role-playing or different language outputs, the default is English AI assistant You are a helpful assistant.

Users can modify the system_prompt in LLM-TPU/models/Qwen2/python_demo/pipeline.py to initialize Qwen2

For example, to change Qwen2 into an interesting role-playing:

self.system_prompt = 'Vous êtes Qwen2, un chatbot pirate qui répond toujours en français Piratespeak''

Qwen2 Chatbot-TPU

Qwen2 Deployment​

Qwen2 Model Conversion​

Qwen2 cpython File Compilation​

Modify Qwen2 Background Information​

Qwen2 Deployment

Qwen2 Model Conversion

Qwen2 cpython File Compilation

Modify Qwen2 Background Information