Llama3.2-1B LLM
This document describes how to use NPU to inference Llama3.2-1B large language model on Radxa Dragon Q8B
Download Example
Download precompiled models using modelscope. Two context-length options are provided:
- 1024
- 4096
Device
export CTX_LENGTH=1024
Device
export CTX_LENGTH=4096
Device
pip3 install modelscope
modelscope download --model radxa/Llama3.2-1B-${CTX_LENGTH}-qairt-v68 --local ./Llama3.2-1B-${CTX_LENGTH}-qairt-v68
Model Inference
Construct Prompt
Llama3.2 prompt construction must follow the format below:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
LLM Inference
Configure Environment
Device
cd Llama3.2-1B-${CTX_LENGTH}-qairt-v68
export LD_LIBRARY_PATH=$(pwd)
chmod +x genie-t2t-run
Execute Inference
Device
./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
rock@radxa-dragon-q6a:~/ssd/qualcomm/701/Llama3.2-1B/to_device$ ./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Using libGenie.so version 1.13.0
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary List Async"
[INFO] "Allocated total size = 84058624 across 1 buffers"
[PROMPT]: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
[BEGIN]: \nRelativity is a fundamental concept in physics that describes the relationship between space and time. It was first introduced by Albert Einstein in 1905, and later developed by Albert Einstein in 1915. The theory of relativity revolutionized our understanding of the universe and had a profound impact on the development of modern physics.[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc
Performance Analysis
You can enable performance analysis using the --profile option
./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' --profile profile.txt
- QCS6490
- SC8280XP
| Dragon Q6A | CTX_LENGTH 1024 | CTX_LENGTH 4096 |
|---|---|---|
| duration | 5,290,527 us | 7,399,282 us |
| num-prompt-tokens | 33 | 33 |
| prompt-processing-rate | 171.67381286621094 toks/sec | 110.03520965576172 toks/sec |
| time-to-first-token | 192,227 us | 299,934 us |
| num-generated-tokens | 62 | 66 |
| token-generation-rate | 12.16189956665039 toks/sec | 9.297395706176758 toks/sec |
| token-generation-time | 5,097,917 us | 7,098,792 us |
| Dragon Q8B | CTX_LENGTH 1024 | CTX_LENGTH 4096 |
|---|---|---|
| duration | 3,105,622 us | 4,481,219 us |
| num-prompt-tokens | 33 | 33 |
| prompt-processing-rate | 261.8486633300781 toks/sec | 159.362548828125 toks/sec |
| time-to-first-token | 126,027 us | 207,097 us |
| num-generated-tokens | 62 | 66 |
| token-generation-rate | 20.82032012939453 toks/sec | 15.443061828613281 toks/sec |
| token-generation-time | 2,977,871 us | 4,273,821 us |