高效运行 QwQ-32B + 错误修复

文章目录

- QwQ-32B 错误修复
- ⚙️ 官方推荐设置
- 👍 推荐的 llama.cpp 设置
- 📖 教程：运行和修复的 QwQ-32B
- - 1、对于 llama.cpp 及使用 llama.cpp 的引擎：
  - 2、下载模型 + 测试
  - 3、测试/评估
  - 4、尝试不使用我们的修复方案：
- 💡 `<think>` 令牌未显示？
- 🧪 实验结果 + 备注
- 🦥 动态 4 位量化
- 🛠️ 微调 QwQ-32B
- 性能基准测试

本文翻译整理自：Run QwQ-32B effectively + Bug Fixes (Mar 7, 2025 • By Daniel & Michael
https://unsloth.ai/blog/qwq-32b

Qwen发布了QwQ-32B，这是一个性能可与DeepSeek-R1相媲美的强大推理模型。你可能遇到过诸如无限循环、重复、令牌错误以及微调挑战等问题，这些问题并不能反映模型的真实质量。我们希望这篇博客能帮助你调试和修复大多数问题！[查看教程](https://unsloth.ai/blog/qwq-32b#Tutorial QwQ)
我们的模型上传包含错误修复和对微调、vLLM 和 Transformers 的工作，但是如果你在使用 llama.cpp 以及作为后端使用 llama.cpp 的引擎，你可能已经遇到了问题。要解决问题，请遵循下面的教程，或阅读我们文档中的详细指南和分析。
查看所有Unsloth修复的QwQ-32B上传，包括GGUF和动态4位，在此处。

QwQ-32B 错误修复

我们发现了一些问题，尤其是影响了微调的部分！EOS令牌是正确的，但PAD令牌可能更应该被 “<|vision_pad|>” 替代。我们已经在这里更新了它。

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

⚙️ 官方推荐设置

根据Qwen，这些是推荐的推理设置：

Temperature of 0.6
Top_K of 40 (or 20 to 40)
Min_P of 0.0
Top_P of 0.95
重复惩罚为1.0。（1.0表示在llama.cpp和transformers中禁用）
聊天模板: <|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

👍 推荐的 llama.cpp 设置

我们注意到很多人使用大于1.0的重复惩罚系数。例如1.1到1.5。这实际上干扰了llama.cpp的采样机制。重复惩罚的目标是惩罚重复的生成，但我们发现这并没有按预期工作。

关闭重复惩罚（即将其设置为1.0）也有效，但我们发现使用它来惩罚无限生成是有用的。

要使用它，我们发现您还必须编辑 llama.cpp 中采样器的顺序，在应用重复惩罚之前，否则将会有无尽的生成。所以添加这个：

--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

默认情况下，llama.cpp 使用以下排序顺序：

--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"

我们重新排序了基本温度和干燥，并将 min_p 前移。这意味着我们按照以下顺序应用采样器：

top_k=40
top_p=0.95
min_p=0.0
temperature=0.6
dry
typ_p
xtc

📖 教程：运行和修复的 QwQ-32B

1、对于 llama.cpp 及使用 llama.cpp 的引擎：

您可以在我们的这里阅读我们的完整指南。获取最新的 llama.cpp 在：github.com/ggml-org/llama.cpp。

您也可以按照下面的构建说明进行操作。如果您没有 GPU 或者只想使用 CPU 推理，将 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF。

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

2、下载模型 + 测试

下载模型通过（在安装 pip install huggingface_hub hf_transfer 后）。您可以选择 Q4_K_M，或其他量化版本（如 BF16 全精度）。其他变体：huggingface.co/unsloth/QwQ-32B-GGUF
然后运行Unsloth的Flappy Bird测试，该测试会将输出保存到 Q4_K_M_yes_samplers.txt

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/QwQ-32B-GGUF",
    local_dir = "unsloth-QwQ-32B-GGUF",
    allow_patterns = ["*Q4_K_M*"], # For Q4_K_M
)

3、测试/评估

编辑 --threads 32 以设置 CPU 线程数，--ctx-size 16384 以设置上下文长度，--n-gpu-layers 99 以设置在多少层上进行 GPU 负载卸载。

如果您的 GPU 内存不足，请尝试调整它。如果您只有 CPU 推理，也请将其删除。
我们使用 --repeat-penalty 1.1 和 --dry-multiplier 0.5，这些值你可以调整。

./llama.cpp/llama-cli \
    --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.6 \
    --repeat-penalty 1.5 \
    --repeat-penalty 1.1 \
    --dry-multiplier 0.5 \
    --min-p 0.0 \
    --top-k 40 \
    --top-p 0.95 \
    -no-cnv \
    --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \
        2>&1 | tee Q4_K_M_yes_samplers.txt

查看示例最终 Python 输出在此. 完整输入为：

<|im_start|>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
<|im_start|>assistant
<think>

运行它时，我们得到一个可执行的游戏！

在这里插入图片描述

4、尝试不使用我们的修复方案：

现在尝试不使用我们的修复方法！所以移除 --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" 这将保存输出到 Q4_K_M_no_samplers.txt

./llama.cpp/llama-cli \
    --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.6 \
    --repeat-penalty 1.5 \
    --repeat-penalty 1.1 \
    --dry-multiplier 0.5 \
    --min-p 0.1 \
    --top-k 40 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \
        2>&1 | tee Q4_K_M_no_samplers.txt

您将遇到一些循环问题，但 问题性的不正确 Python 语法 和许多其他问题。例如下面看起来是正确的，但实际上是错误的！

即第39行 pipes.clear() 抛出错误：NameError: name 'pipes' is not defined. 你忘记导入 ‘pipes’ 了吗？请参考我们的示例，它展示了完全错误的结果在这里。

如果您使用 --repeat-penalty 1.5，情况会更糟，并且更加明显，实际上语法完全错误。

你可能想知道，也许是 Q4_K_M？B16 即全精度应该可以正常工作吧？不正确 - 如果我们不在使用重复惩罚时使用我们的修复方案 --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"，输出又会失败。

💡 `<think>` 令牌未显示？

有些人报告说，由于在聊天模板中默认添加了 <think>，一些系统无法正确输出思维跟踪。您将需要手动编辑 Jinja 模板，从：

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %}

要将以下英文 markdown 文档内容翻译成中文，并保留原本的 markdown 格式，斜体字不翻译，代码也不翻译，内容如下：

通过删除末尾的 <think>\n 来将其移动到另一个位置。现在模型在推理时将需要手动添加 <think>\n，这可能并不总是成功。

DeepSeek 还编辑了所有模型，以默认添加一个 <think> 令牌来强制模型进入推理模式。

因此，将 {%- if add_generation_prompt %}{{- '<|im_start|>assistant\n<think>\n' }} {%- endif %} 更改为 {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}，即删除 <think>\n。

查看移除 <think> 部分（此处）的完整 Jinga 模板在此.

🧪 实验结果 + 备注

我们首先想的是：

1、QwQ的上下文长度并非原生128K，而是32K，通过YaRN扩展实现。我们尝试了覆盖llama.cpp中的YaRN处理，但没有任何变化。例如，在QwQ-32B的readme文件中我们看到以下内容：

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

2、我们也认为可能是 RMS Layernorm 的 epsilon 值不正确——不是 1e-5，而是可能是 1e-6。例如这个有 rms_norm_eps=1e-06，而这个有 rms_norm_eps=1e-05。我们也将它覆盖了，但并没有起作用：

3、我们还测试了在 llama.cpp 和普通 Transformers 之间分词器 ID 是否匹配，归功于 @kalomaze。它们匹配了，所以这并非罪魁祸首。

我们提供了我们的实验结果在我们的文档中。

🦥 动态 4 位量化

我们还上传了动态 4 位量化，与简单的 4 位量化相比提高了准确性！我们将动态 4 位量化上传到了这里。下面附上了 QwQ 量化误差分析图，包括激活和权重量化误差：
自vLLM 0.7.3（2025年2月20日）起，vLLM现在支持加载Unsloth动态4位量化！

在这里插入图片描述

🛠️ 微调 QwQ-32B

QwQ-32B 调优在不到 20GB 的 VRAM 中与 Unsloth 兼容！它还快了 2 倍，并且默认使用我们动态的 4 位量化来提升 QLoRA 的准确性。
由于模型大小，很遗憾模型无法适应免费的Google Colab 16GB VRAM GPU，因此您需要至少20GB VRAM的GPU。要查看我们其他笔记本和模型上传，请访问我们的文档。