跳到主要内容

Llama3.2-1B 大模型

本文档讲述如何在瑞莎 Dragon Q8B 上使用 NPU 推理 Llama3.2-1B 大语言模型

下载示例

使用 modelscope 下载预编译好的模型,这里提供两种 context-length:

Device
export CTX_LENGTH=1024
Device
pip3 install modelscope
modelscope download --model radxa/Llama3.2-1B-${CTX_LENGTH}-qairt-v68 --local ./Llama3.2-1B-${CTX_LENGTH}-qairt-v68

模型推理

构造 prompt

Llama3.2 的 prompt 构造需要遵守以下格式

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

LLM 推理

配置环境

Device
cd Llama3.2-1B-${CTX_LENGTH}-qairt-v68
export LD_LIBRARY_PATH=$(pwd)
chmod +x genie-t2t-run

执行推理

Device
./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
rock@radxa-dragon-q6a:~/ssd/qualcomm/701/Llama3.2-1B/to_device$ ./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Using libGenie.so version 1.13.0

/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary List Async"
[INFO] "Allocated total size = 84058624 across 1 buffers"
[PROMPT]: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

[BEGIN]: \nRelativity is a fundamental concept in physics that describes the relationship between space and time. It was first introduced by Albert Einstein in 1905, and later developed by Albert Einstein in 1915. The theory of relativity revolutionized our understanding of the universe and had a profound impact on the development of modern physics.[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.40.1/point_release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc

性能分析

可以使用 --profile 选项开启性能分析功能

./genie-t2t-run -c ./htp-model-config-llama32-1b-gqa.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nPlease give a brief introduction to relativity.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' --profile profile.txt
Dragon Q8BCTX_LENGTH 1024CTX_LENGTH 4096
duration3,105,622 us4,481,219 us
num-prompt-tokens3333
prompt-processing-rate261.8486633300781 toks/sec159.362548828125 toks/sec
time-to-first-token126,027 us207,097 us
num-generated-tokens6266
token-generation-rate20.82032012939453 toks/sec15.443061828613281 toks/sec
token-generation-time2,977,871 us4,273,821 us

    您需要登录 GitHub 才能发表评论。如果您已登录,请忽略此消息。

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0