MiniCPM-1B-sft

本文档将介绍如何在高通平台上通过 Qualcomm® Genie 使用 NPU 硬件加速推理 MiniCPM-1B-sft 模型

源模型 openbmb/MiniCPM-1B-sft-bf16
源模型许可证：OpenBMB-General-Model-License

模型细节

模型	量化方式	上下文长度
MiniCPM-1B-sft	W4A16	1024

支持设备

提示

请参考 SoC 架构对照表查寻当前设备 SoC 的 DSP 架构

此示例支持 v73 DSP 架构的高通平台 SoC

dsp_arch
v73
运行设备

设备 SoC dsp_arch
Fogwise® AIRbox Q900 QCS9075 v73

dsp_arch
v73

设备	SoC	dsp_arch
Fogwise® AIRbox Q900	QCS9075	v73

下载 qcom-qairt 依赖

QCS6490
QCS9075

Device

sudo apt install qcom-qnn-sdk-v68 qcom-genie-sdk-v68

Device

sudo apt install qcom-qnn-sdk-v73 qcom-genie-sdk-v73

导入环境变量

Device

export ADSP_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu

下载模型

提示

请在 python 虚拟环境中安装 modelscope python 包，虚拟环境使用请参考 Python 虚拟环境使用

Device

pip3 install modelscope
modelscope download --model radxa/MiniCPM-1B-sft-w4a16-1024-v73 --local_dir ./MiniCPM-1B-sft-w4a16-1024-v73

推理模型

Device

cd MiniCPM-1B-sft-w4a16-1024-v73

构建 prompt

prompt 支持以文件形式或者参数形式传入

prompt
prompt_file

<s><user>What is the most popular cookie in the world?</user><assistant>

Device

vim chat.txt

<s><user>What is the most popular cookie in the world?</user><assistant>

执行推理

prompt
prompt_file

Device

genie-t2t-run -c minicpm-1b-htp-228.json -p '<s><user>What is the most popular cookie in the world?</user><assistant>'

Device

genie-t2t-run -c minicpm-1b-htp-228.json --prompt_file chat.txt

(.venv) rock@radxa-airbox-q900:/mnt/ssd/qualcomm/MiniCPM-1B-sft$ genie-t2t-run -c minicpm-1b-htp-228.json -p '<s><user>What is the most popular cookie in the world?</user><assistant>'
Using libGenie.so version 1.14.0

/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO]  "Using create From Binary"
[INFO]  "Allocated total size = 207163392 across 1 buffers"
[PROMPT]: <s><user>What is the most popular cookie in the world?</user><assistant>

[BEGIN]: Themostpopularcookieintheworldislikelytobechocolatechipcookies,whichoriginatedintheUnitedStatesinthe1950s.[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc

性能参考

可以使用 --profile 选项开启性能分析功能

genie-t2t-run -c minicpm-1b-htp-228.json --prompt_file chat.txt --profile profile.txt

Fogwise® AIRbox Q900
GenieDialog_create	977,912 us
num-prompt-tokens	17
prompt-processing-rate	47.047752380371094 toks/sec
time-to-first-token	361,345 us
num-generated-tokens	14
token-generation-rate	41.98681640625 toks/sec
token-generation-time	333,439 us
GenieDialog_free	90,825 us

指标含义解析

指标	含义解释
GenieDialog_create	初始化一个会话对象的时间。包括模型加载、上下文准备、内存分配等。
num-prompt-tokens	本次输入给模型的 prompt（提示词）的 token 数量，也就是模型接收到的文本拆分成的最小单元数量。
prompt-processing-rate	模型处理输入 prompt 的速度，单位为 token 每秒（toks/sec），表示模型分析 prompt 并准备生成输出的效率。
time-to-first-token	从开始处理到生成第一个输出 token 所花的时间，反映模型响应的延迟。
num-generated-tokens	模型在本次生成中实际输出的 token 数量，也就是模型生成的文本长度（以 token 为单位）。
token-generation-rate	模型生成 token 的速度，单位为 token 每秒（toks/sec），反映生成效率。
token-generation-time	模型生成所有输出 token 总共花费的时间，单位通常为微秒（us）。
GenieDialog_free	释放会话对象的时间，包括释放内存和清理资源。

Genie 官方文档

如果想深入了解 Qualcomm® Genie 的使用方法与详细 API 请参考

Genie 官方文档

模型细节​

支持设备​

下载 qcom-qairt 依赖​

导入环境变量​

下载模型​

推理模型​

构建 prompt​

执行推理​

性能参考​

指标含义解析​

Genie 官方文档​

模型细节

支持设备

下载 qcom-qairt 依赖

导入环境变量

下载模型

推理模型

构建 prompt

执行推理

性能参考

指标含义解析

Genie 官方文档