Llama3.2-1B 大模型
本文档讲述如何在瑞莎 Dragon Q8B 上使用 NPU 推理 Llama3.2-1B 大语言模型
下载示例
使用 modelscope 下载预编译好的模型,这里提供两种 context-length:
- 1024
- 4096
Device
export CTX_LENGTH=1024
Device
export CTX_LENGTH=4096
Device
pip3 install modelscope
modelscope download --model radxa/Llama3.2-1B-${CTX_LENGTH}-qairt-v68 --local ./Llama3.2-1B-${CTX_LENGTH}-qairt-v68
模型推理
构造 prompt
Llama3.2 的 prompt 构造需要遵守以下格式
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
LLM 推理
配置环境
Device
cd Llama3.2-1B-${CTX_LENGTH}-qairt-v68
export LD_LIBRARY_PATH=$(pwd)
chmod +x genie-t2t-run
执行推理
Device
./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
rock@radxa-dragon-q6a:~/ssd/qualcomm/701/Llama3.2-1B/to_device$ ./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Using libGenie.so version 1.13.0
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary List Async"
[INFO] "Allocated total size = 84058624 across 1 buffers"
[PROMPT]: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
[BEGIN]: \nRelativity is a fundamental concept in physics that describes the relationship between space and time. It was first introduced by Albert Einstein in 1905, and later developed by Albert Einstein in 1915. The theory of relativity revolutionized our understanding of the universe and had a profound impact on the development of modern physics.[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc
性能分析
可以使用 --profile 选项开启性能分析功能
./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' --profile profile.txt
- QCS6490
- SC8280XP
| Dragon Q6A | CTX_LENGTH 1024 | CTX_LENGTH 4096 |
|---|---|---|
| duration | 5,290,527 us | 7,399,282 us |
| num-prompt-tokens | 33 | 33 |
| prompt-processing-rate | 171.67381286621094 toks/sec | 110.03520965576172 toks/sec |
| time-to-first-token | 192,227 us | 299,934 us |
| num-generated-tokens | 62 | 66 |
| token-generation-rate | 12.16189956665039 toks/sec | 9.297395706176758 toks/sec |
| token-generation-time | 5,097,917 us | 7,098,792 us |
| Dragon Q8B | CTX_LENGTH 1024 | CTX_LENGTH 4096 |
|---|---|---|
| duration | 3,105,622 us | 4,481,219 us |
| num-prompt-tokens | 33 | 33 |
| prompt-processing-rate | 261.8486633300781 toks/sec | 159.362548828125 toks/sec |
| time-to-first-token | 126,027 us | 207,097 us |
| num-generated-tokens | 62 | 66 |
| token-generation-rate | 20.82032012939453 toks/sec | 15.443061828613281 toks/sec |
| token-generation-time | 2,977,871 us | 4,273,821 us |