更新记录

1.0.0（2026-05-18）

新增 Android / iOS 本地离线大模型推理能力
支持 GGUF 模型加载、Chat、Completion、Embedding 和流式输出
支持 static 路径和绝对路径加载本地模型
基于 llama.cpp 原生推理，支持 iOS Metal GPU offload 和 Android arm64 原生执行
支持设备信息、运行时信息、内存预估和 Benchmark，便于评估端侧性能
内置 UniApp X 示例工程，默认使用 Qwen2.5-0.5B-Instruct-Q4_K_M 测试

平台兼容性

uni-app x(5.07)

Chrome	Safari	Android	iOS	鸿蒙	微信小程序
×	×	√	√	-	×

em-llama UTS 使用文档

em-llama 是基于 llama.cpp 的 UniApp X 本地离线推理插件，支持 Android / iOS 加载 GGUF 模型并在端侧执行 Chat、Completion、Embedding、流式输出等能力。

基本用法

import {
  loadModel,
  createSession,
  chat,
  chatStream,
  getRuntimeInfo,
  getStreamEventChannel,
  releaseAll,
} from "@/uni_modules/em-llama";

后文示例中出现的其他 API 也从 @/uni_modules/em-llama 按需导入。

模型下载地址

插件不负责下载模型，下载能力建议由业务 App 自己实现；插件只负责从 static 路径或绝对路径加载本地 GGUF 文件。

Demo 默认使用 Qwen2.5-0.5B-Instruct-Q4_K_M.gguf。下载后可以改名为 demo 中使用的文件名：

Qwen2.5-0.5B-Instruct-Q4_K_M.gguf

如果使用 static 加载，建议放到 UniApp X 工程目录：

static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf

如果由 App 下载到沙盒或外部存储，调用 loadModel 时使用 source: "absolute" 并传入绝对路径。

可以使用哪些模型

只要是当前 llama.cpp 支持的 GGUF 文本模型，通常都可以用本插件加载。推荐优先选择 Q4_K_M、Q5_K_M、Q8_0 这类常见量化文件。

常见可用模型系列：

Qwen / Qwen2 / Qwen2.5 / Qwen3
Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
Mistral / Ministral / Mixtral
Phi-3 / Phi-3.5 / Phi-4 mini
Gemma / Gemma 2 / Gemma 3 的文本 GGUF
DeepSeek-R1 Distill 系列 GGUF
TinyLlama、SmolLM、Yi 等 llama.cpp 支持的 GGUF 文本模型

不能直接使用的模型：

Hugging Face 原始 safetensors、bin、pt 文件。
ONNX、TFLite、CoreML、MLX 等非 GGUF 格式。
Ollama 的 Modelfile 本身；需要使用它实际引用的 GGUF 模型文件。
多模态模型的图片输入能力当前不在插件 API 范围内；如果只加载文本 GGUF，只能按文本模型使用。

下面是 5 个适合先测试的 llama.cpp / GGUF 模型。手机端建议优先测试 0.5B 到 3B；更大的模型对内存、散热和首 token 延迟要求更高。

#	模型	推荐文件	下载地址
1	Qwen2.5-0.5B-Instruct	`qwen2.5-0.5b-instruct-q4_k_m.gguf`	`https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf`
2	Qwen2.5-1.5B-Instruct	`qwen2.5-1.5b-instruct-q4_k_m.gguf`	`https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf`
3	Qwen2.5-3B-Instruct	`qwen2.5-3b-instruct-q4_k_m.gguf`	`https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf`
4	Llama-3.2-1B-Instruct	`Llama-3.2-1B-Instruct-Q4_K_M.gguf`	`https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf`
5	Phi-3.5-mini-instruct	`Phi-3.5-mini-instruct-Q4_K_M.gguf`	`https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/resolve/main/Phi-3.5-mini-instruct-Q4_K_M.gguf`

下载方式：

# 方式一：浏览器打开上面的下载地址，下载完成后放到 static/assets/models/

# 方式二：curl 下载
curl -L -o Qwen2.5-0.5B-Instruct-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf"

# 方式三：huggingface-cli 下载
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF \
  qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --local-dir static/assets/models

加载模型

插件只负责从本地路径加载模型，不内置下载逻辑。模型可以来自 static 路径或绝对路径。

const model = await loadModel({
  source: "static",
  modelPath: "static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf",
  contextSize: 4096,
  threads: 4,
  gpuLayers: 0,
});

iOS 建议使用 Metal：

const model = await loadModel({
  source: "static",
  modelPath: "static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf",
  contextSize: 4096,
  threads: 4,
  gpuLayers: -1,
});

参数说明：

参数	类型	说明
`source`	`'static' \\| 'absolute'`	模型路径类型
`modelPath`	`string`	GGUF 模型路径
`modelId`	`string?`	可选，自定义模型 ID
`contextSize`	`number?`	上下文长度，默认 4096
`batchSize`	`number?`	batch 大小，默认 512
`threads`	`number?`	CPU 线程数
`gpuLayers`	`number?`	GPU offload 层数；iOS 可用 `-1` 表示尽量全量 offload
`useMmap`	`boolean?`	是否 mmap 加载，默认 true
`useMlock`	`boolean?`	是否锁定内存，默认 false
`embedding`	`boolean?`	是否按 embedding 模式加载

绝对路径示例：

const model = await loadModel({
  source: "absolute",
  modelPath: "/storage/emulated/0/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf",
});

会话

生成前先创建会话：

const session = await createSession({
  modelId: model.modelId,
  systemPrompt: "你是一个运行在手机本地的 AI 助手。",
});

释放会话：

await releaseSession(session.sessionId);

Chat

const result = await chat({
  sessionId: session.sessionId,
  messages: [{ role: "user", content: "用一句话介绍本地大模型。" }],
  maxTokens: 128,
  temperature: 0.3,
  repeatPenalty: 1.1,
  stop: ["<|im_end|>", "<|endoftext|>"],
});

console.log(result.text);
console.log(result.usage.tokensPerSecond);

messages 支持 system、user、assistant、tool 角色。

Completion

const result = await complete({
  sessionId: session.sessionId,
  prompt: "手机端本地推理的优势是",
  maxTokens: 128,
  temperature: 0.7,
});

流式输出

流式接口返回任务信息，token 通过统一事件通道推送。

const channel = getStreamEventChannel();

uni.$on(channel, (raw: any) => {
  const event = JSON.parse(`${raw}`);
  if (event.type == "token") {
    console.log(event.token);
  }
  if (event.type == "done") {
    console.log(event.usage);
    uni.$off(channel);
  }
  if (event.type == "error") {
    console.error(event.message);
    uni.$off(channel);
  }
});

const task = await chatStream({
  sessionId: session.sessionId,
  messages: [{ role: "user", content: "写一句短诗。" }],
  maxTokens: 128,
  temperature: 0.7,
  stop: ["<|im_end|>", "<|endoftext|>"],
});

console.log(task.taskId);

流式事件：

`type`	说明
`start`	任务开始
`token`	新 token，字段为 `token` / `text`
`done`	完成，包含 `text` 和 `usage`
`cancelled`	已取消
`error`	出错，包含 `code` 和 `message`

取消任务：

await cancelTask(task.taskId);

查询任务状态：

const status = await getTaskStatus(task.taskId);

JSON / 结构化输出

const result = await generateJson({
  sessionId: session.sessionId,
  messages: [
    { role: "user", content: "返回一个用户信息对象，名字叫 Alice，年龄 18。" },
  ],
  jsonSchema: {
    type: "object",
    properties: {
      name: { type: "string" },
      age: { type: "number" },
    },
    required: ["name", "age"],
  },
  maxTokens: 128,
  temperature: 0.1,
});

需要严格约束时，优先传 grammar / grammarRoot。jsonSchema 字段保留给上层做 Schema 到 GBNF 的转换；Android 的 generateJson 在未传 grammar 时会补一个通用 JSON grammar，iOS 当前建议由应用侧传入 grammar 或在生成后做 JSON parse/validate。

Embedding

加载模型时建议开启 embedding：

const model = await loadModel({
  source: "static",
  modelPath: "static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf",
  embedding: true,
});

单条文本：

const result = await embedding({
  modelId: model.modelId,
  text: "本地大模型",
  normalize: true,
});

批量文本：

const result = await embeddingBatch({
  modelId: model.modelId,
  texts: ["本地推理", "云端推理"],
  normalize: true,
});

Tokenize / Detokenize

const encoded = await tokenize({
  modelId: model.modelId,
  text: "你好",
  addSpecial: true,
  parseSpecial: true,
});

const decoded = await detokenize({
  modelId: model.modelId,
  tokens: encoded.tokens,
});

LoRA

const lora = await loadLora({
  modelId: model.modelId,
  loraPath: "/absolute/path/to/adapter.gguf",
  scale: 0.8,
});

await setLoraScale({
  loraId: lora.loraId,
  scale: 1.0,
});

await unloadLora(lora.loraId);

KV Cache / Session State

保存状态：

const state = await saveSessionState({
  sessionId: session.sessionId,
  path: "/absolute/path/to/session.bin",
});

恢复状态：

await loadSessionState({
  sessionId: session.sessionId,
  path: state.path,
});

删除状态文件：

await deleteSessionState(state.path);

Benchmark / 内存预估 / 设备信息

const runtime = getRuntimeInfo();
const device = await getDeviceInfo();

const estimate = await estimateMemory({
  source: "static",
  modelPath: "static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf",
  contextSize: 4096,
  gpuLayers: -1,
});

const bench = await benchmark({
  modelId: model.modelId,
  prompt: "介绍本地大模型",
  generateTokens: 128,
  rounds: 1,
});

getRuntimeInfo() 常用字段：

字段	说明
`backend`	当前后端，如 `cpu` / `metal`
`gpuLayers`	当前 GPU offload 层数
`modelLoaded`	是否已加载模型
`supportsMetal`	iOS Metal 支持
`supportsVulkan`	Android Vulkan 支持
`availableBackends`	原生 runtime 可见后端

释放资源

await releaseAll();

切换模型前建议先释放旧模型，或直接重新调用 loadModel。

常见问题

`model is not loaded`

先确认：

loadModel 没有返回 error。
createSession 使用的是 loadModel 返回的 modelId。
Chat / Completion 使用的是 createSession 返回的 sessionId。
getRuntimeInfo().modelLoaded 为 true。

iOS token 速度很慢

确认 gpuLayers 设置为 -1，并且 getRuntimeInfo() 中 backend 为 metal

Android static 模型路径

模型应放到 UniApp X 工程的 static 目录，例如：

static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf

加载时使用：

{
  source: 'static',
  modelPath: 'static/assets/models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf'
}

流式输出没有回调

先注册 uni.$on(getStreamEventChannel(), callback)，再调用 chatStream 或 completeStream。结束或出错后调用 uni.$off(channel) 清理监听。

em-llama 高性能本地大模型 uts uniappx llama gguf AI

更新记录

1.0.0（2026-05-18）

平台兼容性

uni-app x(5.07)

em-llama UTS 使用文档

基本用法

模型下载地址

可以使用哪些模型

加载模型

会话

Chat

Completion

流式输出

JSON / 结构化输出

Embedding

Tokenize / Detokenize

LoRA

KV Cache / Session State

Benchmark / 内存预估 / 设备信息

释放资源

推荐配置

常见问题

`model is not loaded`

iOS token 速度很慢

Android static 模型路径

流式输出没有回调

隐私、权限声明

1. 本插件需要申请的系统权限列表：

2. 本插件采集的数据、发送的服务器地址、以及数据用途说明：

3. 本插件是否包含广告，如包含需详细说明广告表达方式、展示频率：

原生图片选择裁剪插件

uniapp开发脚手架、支持后端配置多语言、热更新、整包更新、暗黑模式

Mob短信验证码插件【暂不维护，mobsms已经开始收费】

第四范式智能客服插件Android/IOS

小米即时消息云原生插件支持Android和IOS，支持超大群聊

em-llama 高性能本地大模型 uts uniappx llama gguf AI

更新记录

1.0.0（2026-05-18）

平台兼容性

uni-app x(5.07)

em-llama UTS 使用文档

基本用法

模型下载地址

可以使用哪些模型

加载模型

会话

Chat

Completion

流式输出

JSON / 结构化输出

Embedding

Tokenize / Detokenize

LoRA

KV Cache / Session State

Benchmark / 内存预估 / 设备信息

释放资源

推荐配置

常见问题

model is not loaded

iOS token 速度很慢

Android static 模型路径

流式输出没有回调

隐私、权限声明

1. 本插件需要申请的系统权限列表：

2. 本插件采集的数据、发送的服务器地址、以及数据用途说明：

3. 本插件是否包含广告，如包含需详细说明广告表达方式、展示频率：

原生图片选择裁剪插件

uniapp开发脚手架、支持后端配置多语言、热更新、整包更新、暗黑模式

Mob短信验证码插件【暂不维护，mobsms已经开始收费】

第四范式智能客服插件Android/IOS

小米即时消息云原生插件支持Android和IOS，支持超大群聊

Modal title

`model is not loaded`