Llama-3.2-3B-Instruct
This document describes how to perform NPU hardware-accelerated inference of the Llama-3.2-3B-Instruct model on Qualcomm platforms using Qualcomm® Genie.
-
Source model: meta-llama/Llama-3.2-3B-Instruct
-
Source model license: LLAMA 3.2 COMMUNITY LICENSE AGREEMENT
Model Details
| Model | Quantization | Context Length |
|---|---|---|
| Llama-3.2-3B-Instruct | W4A16 | 4096 |
Supported Devices
Refer to the SoC Architecture Reference to find the DSP architecture of your device's SoC.
-
This example supports Qualcomm platform SoCs with v73 DSP architecture.
dsp_arch v73 -
Supported devices
Device SoC dsp_arch Fogwise® AIRbox Q900 QCS9075 v73
Download qcom-qairt Dependencies
- QCS6490
- QCS9075
sudo apt install qcom-qnn-sdk-v68 qcom-genie-sdk-v68
sudo apt install qcom-qnn-sdk-v73 qcom-genie-sdk-v73
Import Environment Variables
export ADSP_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu
Download Model
Please install the modelscope Python package in a Python virtual environment. For virtual environment usage, refer to Python Virtual Environment Usage
pip3 install modelscope
modelscope download --model radxa/Llama-3.2-3B-Instruct-w4a16-4096-v73 --local_dir ./Llama-3.2-3B-Instruct-w4a16-4096-v73
Run Inference
cd Llama-3.2-3B-Instruct-w4a16-4096-v73
Build Prompt
Prompts can be passed as a file or as a parameter.
- prompt
- prompt_file
<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
vim chat.txt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Run Inference
- prompt
- prompt_file
genie-t2t-run -c Meta-Llama-3.2-3B-Instruct-htp.json -p '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant\n'
genie-t2t-run -c Meta-Llama-3.2-3B-Instruct-htp.json --prompt_file chat.txt
(.venv) rock@radxa-airbox-q900:/mnt/ssd/qualcomm/Meta-Llama-3.2-3B-Instruct/QCS9075$ genie-t2t-run -c Meta-Llama-3.2-3B-Instruct-htp.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
Using libGenie.so version 1.14.0
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary"
[INFO] "Allocated total size = 270369280 across 5 buffers"
[PROMPT]: <|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[BEGIN]:
Yer looking' fer a scurvy dog, eh? Well, matey, I be Captain Chatbotbeard, the pirate chatbot o' the seven seas! Me and me crew o' code be here fer ye, serving' up swashbucklin' answers an' pirate-tastic conversation! Yer got questions, I be ready to answer 'em, matey! What be bringing' ye to these waters?[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc
Performance Reference
You can enable performance profiling with the --profile option.
genie-t2t-run -c Meta-Llama-3.2-3B-Instruct-htp.json --prompt_file chat.txt --profile profile.txt
| Fogwise® AIRbox Q900 | |
|---|---|
| GenieDialog_create | 1,810,020 us |
| num-prompt-tokens | 32 |
| prompt-processing-rate | 309.02349853515625 toks/sec |
| time-to-first-token | 103,567 us |
| num-generated-tokens | 100 |
| token-generation-rate | 19.36370849609375 toks/sec |
| token-generation-time | 5,164,398 us |
| GenieDialog_free | 109,659 us |
Metric Definitions
| Metric | Definition |
|---|---|
| GenieDialog_create | Time to initialize a dialog object, including model loading, context preparation, and memory allocation. |
| num-prompt-tokens | Number of tokens in the prompt sent to the model (i.e., the smallest unit the input text is split into). |
| prompt-processing-rate | Speed at which the model processes the prompt, in tokens per second (toks/sec), reflecting the efficiency of prompt analysis and output preparation. |
| time-to-first-token | Time elapsed from the start of processing to the generation of the first output token, reflecting the model's response latency. |
| num-generated-tokens | Number of tokens actually output by the model in this generation, representing the length of the generated text in tokens. |
| token-generation-rate | Speed at which the model generates tokens, in tokens per second (toks/sec), reflecting generation efficiency. |
| token-generation-time | Total time spent generating all output tokens, in microseconds (us). |
| GenieDialog_free | Time to free the dialog object, including memory release and resource cleanup. |
Official Genie Documentation
For more details on Qualcomm® Genie usage and API, refer to: