Skip to main content

Llama-3.1-8B-Instruct

This document describes how to perform NPU hardware-accelerated inference of the Llama-3.1-8B-Instruct model on Qualcomm platforms using Qualcomm® Genie.

Model Details

ModelQuantizationContext Length
Llama-3.1-8B-InstructW4A164096

Supported Devices

tip

Refer to the SoC Architecture Reference to find the DSP architecture of your device's SoC.

  • This example supports Qualcomm platform SoCs with v73 DSP architecture.

    dsp_arch
    v73
  • Supported devices

    DeviceSoCdsp_arch
    Fogwise® AIRbox Q900QCS9075v73

Download qcom-qairt Dependencies

Device
sudo apt install qcom-qnn-sdk-v73 qcom-genie-sdk-v73

Import Environment Variables

Device
export ADSP_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu

Download Model

tip

Please install the modelscope Python package in a Python virtual environment. For virtual environment usage, refer to Python Virtual Environment Usage

Device
pip3 install modelscope
modelscope download --model radxa/Llama-3.1-8B-Instruct-w4a16-4096-v73 --local_dir ./Llama-3.1-8B-Instruct-w4a16-4096-v73

Run Inference

Device
cd Llama-3.1-8B-Instruct-w4a16-4096-v73

Build Prompt

Prompts can be passed as a file or as a parameter.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>n\Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n

Run Inference

Device
genie-t2t-run -c Meta-Llama-3.1-8B-Instruct-htp.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>n\Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
(.venv) rock@radxa-airbox-q900:/mnt/ssd/qualcomm/Meta-Llama-3.1-8B-Instruct/qnn229_q8280_cl4096$ genie-t2t-run -c Meta-Llama-3.1-8B-Instruct-htp.json -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>n\Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
Using libGenie.so version 1.14.0

/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
[INFO] "Using create From Binary List Async"
[INFO] "Allocated total size = 306545152 across 10 buffers"
[PROMPT]: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>n\Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n

[BEGIN]: Me be ol' Blackbeak Billy, a scurvy pirate scallywag! I sail the treacherous seven seas may friend, in search of gold, jewels, and fine rum to drink! Yer might be wonderin' why ye should me of me? Fear me if ye be, landlubber! I be here to guide ye through me wittles of trivia an' piratey priddy jokes! What be trouvin' ye today, me hearty?[END]
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.42.0/release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:42:dummy call to rpcmem_deinit, rpcmem APIs will be used from libxdsprpc

Performance Reference

You can enable performance profiling with the --profile option.

genie-t2t-run -c Meta-Llama-3.1-8B-Instruct-htp.json --prompt_file chat.txt --profile profile.txt
Fogwise® AIRbox Q900
GenieDialog_create2,182,009 us
num-prompt-tokens32
prompt-processing-rate140.2327880859375 toks/sec
time-to-first-token228,202 us
num-generated-tokens65
token-generation-rate8.106289863586426 toks/sec
token-generation-time8,018,490 us
GenieDialog_free190,095 us

Metric Definitions

MetricDefinition
GenieDialog_createTime to initialize a dialog object, including model loading, context preparation, and memory allocation.
num-prompt-tokensNumber of tokens in the prompt sent to the model (i.e., the smallest unit the input text is split into).
prompt-processing-rateSpeed at which the model processes the prompt, in tokens per second (toks/sec), reflecting the efficiency of prompt analysis and output preparation.
time-to-first-tokenTime elapsed from the start of processing to the generation of the first output token, reflecting the model's response latency.
num-generated-tokensNumber of tokens actually output by the model in this generation, representing the length of the generated text in tokens.
token-generation-rateSpeed at which the model generates tokens, in tokens per second (toks/sec), reflecting generation efficiency.
token-generation-timeTotal time spent generating all output tokens, in microseconds (us).
GenieDialog_freeTime to free the dialog object, including memory release and resource cleanup.

Official Genie Documentation

For more details on Qualcomm® Genie usage and API, refer to:

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0