Qwen2 Chatbot-TPU
Qwen2 ChatBot-TPU uses the Sophon SDK to port the Alibaba open-source Qwen2 model to the SG2300X series chips, enabling hardware-accelerated inference using local TPUs. The model is designed as a chatbot using Gradio, allowing users to ask practical questions.
Qwen2 Deployment
-
Clone the repository
git clone https://github.com/zifeng-radxa/LLM-TPU.git
-
Open the Qwen2 project path
cd LLM-TPU/models/Qwen2/python_demo
-
This example provides the Qwen2-7B-Instruct 4-bit quantized model qwen2-7b_int4_seq512_1dev.bmodel and the C++ precompiled file download
Users can refer to Qwen2 Model Conversion to convert different quantization methods of the Qwen2 model
Users can refer to Qwen2 cpython File Compilation to compile the cpython interface binding file
# qwen2-7b_int4_seq512_1dev.bmodel
wget https://github.com/radxa-edge/TPU-Edge-AI/releases/download/qwen2/tar_downloader.sh
bash tar_downloader.sh
tar -xvf qwen2-7b_int4_seq512_1dev.tar.gz -
Set up the environment
You must create a virtual environment to avoid affecting other applications, please refer to here for virtual environment usage
python3 -m virtualenv .venv
source .venv/bin/activate -
Install dependencies
pip3 install --upgrade pip
pip3 install gradio transformers -
Import environment variables
Please use the ldd command to check if the path linked by chat.cpython-38-aarch64-linux-gnu.so to libbmlib.so is
LLM-TPU/support/lib_soc/libbmlib.so
If the link path of
libbmlib.so
is incorrect, you can run the following command, please specify the LLM-TPU path correctlyexport LD_LIBRARY_PATH=LLM-TPU/support/lib_soc:$LD_LIBRARY_PATH
-
Start Qwen2
(Optional) Modify Qwen2 output language or role-playing, please refer to Modify Qwen2 Background Information
Terminal mode
python3 pipeline.py --model_path your_bmodel_path --tokenizer_path ../support/token_config/
-m; --model_path
: Specify the model path-t; --tokenizer_path
: Specify the token_config folder path, default is ../support/token_configGradio mode
python3 web_demo.py -m your_bmodel_path -t ../support/token_config/
-m; --model_path
: Specify the model path-t; --tokenizer_path
: Specify the token_config folder path, default is ../support/token_configAccess the Airbox IP address on port 8003 in your browser
Qwen2 Model Conversion
Users can refer to this document to convert different quantization types of the Qwen2-7B-Instruct model to bmodel
-
Prepare the environment on an X86 workstation
Please refer to TPU-MLIR Installation to configure the TPU-MLIR environment
Clone the repository
git clone https://github.com/zifeng-radxa/LLM-TPU.git
-
Download the Qwen2 open-source model from Huggingface
tipEnsure you have installed git lfs
git clone https://huggingface.co/Qwen/Qwen2-7B-Instruct
-
Create a virtual environment in the working directory
LLM-TPU/models/Qwen2
Refer to here for virtual environment usage
python3 -m virtualenv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install transformers_stream_generator einops tiktoken accelerate torch==2.0.1+cpu torchvision==0.15.2 transformers==4.41.2 -
Align the model environment
Copy
LLM-TPU/models/Qwen2/compile/files/Qwen2-7B-Instruct/modeling_llama.py
to the transformers library, note that the transformers library should be in .venvcp ./compile/files/Qwen2-7B-Instruct/modeling_llama.py .venv/lib/python3.10/site-packages/transformers/models/qwen2
-
Generate the onnx file
cd compile
python export_onnx.py --model_path your_model_path --seq_length 512--model_path
: The path to the downloaded Qwen2 folder--seq_length
: The fixed sequence length for export, choose 512, 1024, 2048, etc., as needed -
Generate the bmodel file
Exit the virtual environment before generating the bmodel
deactivate
Compile the model
./compile.sh --mode int4 --name qwen2-7b --addr_mode io_alone --seq_length 512 # same as int8
--mode
: Quantization mode, choose int4, int8--seq_length
: Sequence length, must match the seq_length specified when generating the onnx file--name
: Model name, must be qwen2-7b--addr_mode
: Network address allocation mode, choose io_alone modetipGenerating bmodel takes more than 1 hour. It is recommended to have 64G memory and more than 100G disk space to avoid OOM or no space left
Qwen2 cpython File Compilation
Compile the executable file in Airbox, the precompiled file is included in the qwen2-7b_int4_seq512_1dev.tar.gz download package, no need to compile if already downloaded
cd python_demo
mkdir build
cd build && cmake .. && make && cp *cpython* .. && cd ..
Modify Qwen2 Background Information
Users can modify Qwen2's initial background information to change Qwen2 into different role-playing or different language outputs, the default is English AI assistant You are a helpful assistant.
Users can modify the system_prompt
in LLM-TPU/models/Qwen2/python_demo/pipeline.py
to initialize Qwen2
For example, to change Qwen2 into an interesting role-playing:
self.system_prompt = 'Vous êtes Qwen2, un chatbot pirate qui répond toujours en français Piratespeak''