Qwen1.5-1.8B-Chat
本文档将介绍如何在高通平台上通过 Qualcomm® Genie 使用 NPU 硬件加速推理 Qwen1.5-1.8B-Chat 模型
模型细节
| 模型 | 量化方式 | 上下文长度 |
|---|---|---|
| Qwen1.5-1.8B-Chat | W4A16 | 1024 |
支持设备
提示
请参考 SoC 架构对照表 查寻当前设备 SoC 的 DSP 架构
-
此示例支持 v73 DSP 架构的高通平台 SoC
dsp_arch v73 -
运行设备
设备 SoC dsp_arch Fogwise® AIRbox Q900 QCS9075 v73
下载 qcom-qairt 依赖
- QCS6490
- QCS9075
Device
sudo apt install qcom-qnn-sdk-v68 qcom-genie-sdk-v68
Device
sudo apt install qcom-qnn-sdk-v73 qcom-genie-sdk-v73
导入环境变量
Device
export ADSP_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu
下载模型
提示
请在 python 虚拟环境中安装 modelscope python 包,虚拟环境使用请参考 Python 虚拟环境使用
Device
pip3 install modelscope
modelscope download --model radxa/Qwen1.5-1.8B-Chat-w4a16-1024-v73 --local_dir ./Qwen1.5-1.8B-Chat-w4a16-1024-v73
推理模型
Device
cd Qwen1.5-1.8B-Chat-w4a16-1024-v73
构建 prompt
prompt 支持以文件形式或者参数形式传入
- prompt
- prompt_file
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant
Device
vim chat.txt
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant
执行推理
- prompt
- prompt_file
Device
genie-t2t-run -c qwen1.5-1.8b-chat-htp.json -p '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant'
Device
genie-t2t-run -c qwen1.5-1.8b-chat-htp.json --prompt_file chat.txt
(.venv) rock@radxa-airbox-q900:/mnt/ssd/qualcomm/Qwen1.5-1.8B-Chat$ genie-t2t-run -c qwen1.5-1.8b-chat-htp.json -p '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant'
Using libGenie.so version 1.14.0
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary"
[INFO] "Allocated total size = 426774528 across 8 buffers"
[PROMPT]: <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant
[BEGIN]:
Large language model (LLM) is a type of artificial intelligence (AI) that is designed to process and generate human-like language. These models are typically large, complex systems that are trained on vast amounts of text data, including books, articles, and other written materials. They use advanced algorithms and statistical techniques to analyze and understand the structure and meaning of language, allowing them to generate text that is both coherent and contextually appropriate.
The main purpose of an LLM is to process natural language inputs from users, such as questions, statements, or commands, and generate human-like responses. These responses can be in the form of text, speech, or other forms of output, and they are often used in a wide range of applications, including chatbots, virtual assistants, language translation, text summarization, and more.
There are several different types of LLMs, each with its own strengths and weaknesses. Some examples include:
1. Recurrent Neural Networks (RNNs): These are a type of neural network that are designed to process sequences of data, such as sentences or words. They use a feedback loop to process the previous words in a sentence and use this information to generate the next word. RNNs are particularly effective at processing natural language, and they have been used in a wide range of LLM applications, including language translation and text generation.
2. Transformer Models: These are a type of neural network that are designed to process sequences of data, but they are particularly effective at processing long sequences of words. They use a self-attention mechanism to process the information in each word and generate a sequence of words that is coherent and contextually appropriate. Transformer models have been used in a wide range of LLM applications, including language translation and text generation.
3. Seq2Seq Models: These are a type of LLM that are designed to process sequences of data, such as sentences or paragraphs. They use a two-layer feedforward neural network to process the input sequence and generate the output sequence. Seq2Seq models are often used in tasks that involve processing large amounts of text, such as sentiment analysis or text summarization.
4. Seq3Seq Models: These are a type of LLM that are designed to process sequences of data, such as sentences or paragraphs. They use a three-layer feedforward neural network to process the input sequence and generate the output sequence. Seq3Seq models are often used in tasks that involve processing large amounts of text, such as text summarization or question answering.
Overall, the key to developing an effective LLM is to use advanced algorithms and statistical techniques to analyze and understand the structure and meaning of language. By training an LLM on large amounts of text data, developers can create models that are capable of generating text that is both coherent and contextually appropriate, and that can be used in a wide range of applications.[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc
性能参考
可以使用 --profile 选项开启性能分析功能
genie-t2t-run -c qwen1.5-1.8b-chat-htp.json --prompt_file chat.txt --profile profile.txt
| Fogwise® AIRbox Q900 | |
|---|---|
| GenieDialog_create | 1,636,745 us |
| num-prompt-tokens | 29 |
| prompt-processing-rate | 62.19292068481445 toks/sec |
| time-to-first-token | 466,319 us |
| num-generated-tokens | 349 |
| token-generation-rate | 33.35779571533203 toks/sec |
| token-generation-time | 10,462,572 us |
| GenieDialog_free | 202,542 us |
指标含义解析
| 指标 | 含义解释 |
|---|---|
| GenieDialog_create | 初始化一个会话对象的时间。包括模型加载、上下文准备、内存分配等。 |
| num-prompt-tokens | 本次输入给模型的 prompt(提示词)的 token 数量,也就是模型接收到的文本拆分成的最小单元数量。 |
| prompt-processing-rate | 模型处理输入 prompt 的速度,单位为 token 每秒(toks/sec),表示模型分析 prompt 并准备生成输出的效率。 |
| time-to-first-token | 从开始处理到生成第一个输出 token 所花的时间,反映模型响应的延迟。 |
| num-generated-tokens | 模型在本次生成中实际输出的 token 数量,也就是模型生成的文本长度(以 token 为单位)。 |
| token-generation-rate | 模型生成 token 的速度,单位为 token 每秒(toks/sec),反映生成效率。 |
| token-generation-time | 模型生成所有输出 token 总共花费的时间,单位通常为微秒(us)。 |
| GenieDialog_free | 释放会话对象的时间,包括释放内存和清理资源。 |
Genie 官方文档
如果想深入了解 Qualcomm® Genie 的使用方法与详细 API 请参考