Llama-Factory + Ollama 打造属于自己的中文版 Llama3

Meta 推出 Llama3 也有一小段时间了。Llama3 包含 8B 和 70B 两种参数规模，涵盖预训练和指令调优的变体。Llama 3 支持多种商业和研究用途，并已在多个行业标准测试中展示了其卓越的性能（关于Llama3的具体介绍可以参考本站另外一篇博文：尝鲜 Meta Llama 3）。和以往的原始 Llama 模型一样，Llama 3 对中文的支持效果欠佳，经常会出现你用中文提问，它用英文或中文+英文回复的现象。

例如当我问它：”你是谁？“，会得到如下的回答：

>>> 你是谁？
I am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but a computer program designed to simulate conversation and answer questions to the best of my ability based on my training data. I can generate text on a wide range of topics, from science and history to entertainment and culture.

然后让它”介绍一下上海的旅游景点“，则会得到如下的回答：

>>> 介绍一下上海的旅游景点
Shanghai, a city with a rich history and culture, has numerous tourist attractions that showcase its modernity, traditional charm, and natural beauty. Here are some of the most popular tourist spots in Shanghai:

**Landmarks:**

1. **Oriental Pearl Tower**: A iconic TV tower with panoramic views of the city.
2. **Shanghai Tower**: The second-tallest building in the world, offering stunning views from its observation deck.
3. **Jin Mao Tower**: A 421-meter-tall skyscraper and home to the luxurious Mandarin Oriental Hotel.
4. **The Bund**: A famous waterfront promenade lined with colonial-era buildings and stunning views of the Huangpu River.

**Historical Sites:**

1. **Yu Garden**: A classic Chinese garden with beautiful landscaping, pavilions, and koi ponds.
2. **Shanghai Museum**: A museum showcasing Chinese art, history, and culture.
3. **Xintiandi**: A charming neighborhood with traditional Shikumen architecture and modern boutiques.
4. **French Concession**: A historic neighborhood with colonial-era buildings, cafes, and restaurants.

**Nature Reserves:**

1. **Changfeng Park**: A large public park with gardens, lakes, and walking trails.
2. **Fuxing Island**: A tranquil oasis in the Huangpu River with beautiful landscaping and walking paths.
3. **Shanghai Botanical Garden**: A peaceful garden featuring various plant species and a peaceful atmosphere.

**Other Attractions:**

1. **Disney Resort Shanghai**: An amusement park with Disney-themed rides, shows, and attractions.
2. **Shanghai Grand Opera House**: A stunning opera house with performances ranging from music to dance.
3. **Shanghai Zoo**: A popular zoo with various animal species and a panda enclosure.
4. **The Shanghai Tower's Observation Deck**: For panoramic views of the city.

These are just a few of the many tourist attractions in Shanghai. Depending on your interests, there are plenty more hidden gems to explore! 😊

回答得还是很简洁而全面的。缺点就是全是英文答案，对于英文不太好的人有点吃力。对 Llama3 进行微调来支持中文是一项必须的工作。

Llama3 中文模型的现状

实际上，当 Llama3 推出以后，很多开发者在中文支持方面做了大量的工作，一大批微调的中文模型如雨后春笋般纷纷涌现。目前比较知名的中文模型有：

Unichat-llama3-Chinese（地址）：中国联通AI创新中心发布业界第一个llama3中文指令微调模型，全参数微调(非lora)，以Meta Llama 3为基础,增加中文数据进行训练,实现llama3模型高质量中文问答，模型上下文保持原生长度8K，支持长度64K版本将于后续发布。
OpenBuddy – Open Multilingual Chatbot（地址）：OpenBuddy是一个面向全球用户的强大的开放式多语言聊天机器人模型，强调对话式AI以及对英语、中文和其他语言的无缝多语言支持。
Llama3-Chinese（地址）：在500k高质量中文多轮SFT数据、100k英语多轮SFT数据和2k单轮自我认知数据上训练的大型模型，采用基于Meta的DORA和LORA+的训练方法。以Llama-3-8B为基础。
Llama3-8B-Chinese-Chat（地址）：是一个针对中文和英文用户的指令调整语言模型，具有角色扮演和工具使用等各种能力，建立在 Meta-Llama-3-8B-Instruct 模型的基础上。
Llama-3-8B-Instruct-Chinese-chat（地址）：使用 firefly-train-1.1M，moss-003-sft-data，school_math_0.25M，ruozhiba 等数据集微调的模型，基于Llama-3-8B-Instruct。
Bunny-Llama-3-8B-V（地址）：Bunny 是一系列轻量但功能强大的多模式模型。它提供多种即插即用视觉编码器，如 EVA-CLIP、SigLIP 和语言主干，包括 Llama-3-8B、Phi-1.5、StableLM-2、Qwen1.5、MiniCPM 和 Phi-2。为了弥补模型大小的减小，通过从更广泛的数据源中进行精选来构建信息更丰富的训练数据。
llava-llama-3-8b-v1_1（地址）：llava-llama-3-8b-v1_1 是一个 LLaVA 模型，由XTuner使用ShareGPT4V-PT和InternVL-SFT从meta-llama/Meta-Llama-3-8B-Instruct和CLIP-ViT-Large-patch14-336进行微调。
。。。

还有很多新的模型就不一一列举了，想要尝试这些模型可以直接部署试用。本文则探讨如何使用 Llama-Factory 对 Llama3 进行中文微调的具体过程，并通过 Ollama 本地部署中文微调的 Llama3 模型，打造属于自己的个性化的 Llama3 LLM 。

使用 Llama-Factory 对 Llama3 进行中文微调

LLaMA-Factory 是一个开源项目，它提供了一套全面的工具和脚本，用于微调、服务和基准测试 LLaMA 模型。LLaMA（大型语言模型自适应）是 Meta AI 开发的一组基础语言模型，在各种自然语言任务上表现出色。

LLaMA-Factory 存储库提供以下内容，让您轻松开始使用 LLaMA 模型：

数据预处理和标记化的脚本
用于微调 LLaMA 模型的训练流程
使用经过训练的模型生成文本的推理脚本
评估模型性能的基准测试工具
用于交互式测试的 Gradio Web UI

关于 Llama-Factory 的具体介绍可以参考本站的另外一篇博文：LLaMA-Factory 简介。

安装 Llama-Factory

首先从 github 拉取 Llama-Factory：

git clone https://github.com/hiyouga/LLaMA-Factory.git

为了方便今后的调试和部署，我选择了使用 docker 的方式来运行 Llama-Factory。它提供了一个参考的 Dockerfile：

FROM nvcr.io/nvidia/pytorch:24.01-py3

WORKDIR /app

COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY . /app/
RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]

VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
EXPOSE 7860

CMD [ "llamafactory-cli webui" ]

可以根据自己的实际情况进行修改。我家里的网络自己搭建了 proxy server（为了访问 github, huggingface 等站点），所以更改 Dockerfile 如下。该 Dockerfile 从 docker buildx 命令行获取 http_proxy 和 https_proxy 变量，并设置 docker buildx 环境里的相应环境变量，这样编译 docker 镜像的过程中就能使用代理服务器了。

FROM nvcr.io/nvidia/pytorch:24.01-py3


# 使用构建参数设置环境变量
ARG http_proxy
ARG https_proxy

# 设置环境变量
ENV HTTP_PROXY=$http_proxy
ENV HTTPS_PROXY=$https_proxy
ENV http_proxy=$http_proxy
ENV https_proxy=$https_proxy

WORKDIR /app

COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY . /app/
RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]

# unset环境变量。Container运行过程中需要代理服务器的话通过-e 传入参数
ENV HTTP_PROXY=""
ENV HTTPS_PROXY=""
ENV http_proxy=""
ENV https_proxy=""

VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
EXPOSE 7860

CMD [ "python", "src/train_web.py" ]

运行如下脚本生成自己的 Llama-Factory docker 镜像：

docker buildx build --build-arg http_proxy=http://proxy_ip:port --build-arg https_proxy=http://proxy_ip:port -t llama-factory:v0.00 .

Docker 镜像编译成功后，运行 docker image list 就可以看到编译出来的 docker 镜像了：

docker image list

接下来我们需要运行 Llama-Factory container。Llama-Factory 提供了一个参考的 docker-compose.yml 文件来运行 docker，我们可以按照自己的实际情况进行修改。我这边修改的版本如下，修改的部分参考注释。

version: '3.8'

services:
  llama-factory:
    #build:
    #  dockerfile: Dockerfile
    #  context: .
    image: llama-factory:v0.00 # 修改为编译出来的 docker image 名称/版本
    container_name: llama_factory # container 名称
    volumes:
      - ./volumes/huggingface:/root/.cache/huggingface/
      - ./volumes/data:/app/data
      - ./volumes/output:/app/output
      - /mnt/dev/myprojects/llm-webui-docker/models:/app/models # 映射自己的models目录
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - GRADIO_SERVER_PORT=7864 # webui跑在7864端口上，7860被comfyui占用了
    ports:
      - "7864:7864" # webui跑在7864端口上，7860被comfyui占用了
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: "all"
            capabilities: [gpu]
    restart: unless-stopped

运行如下命令启动 Llama-Factory 容器：

docker-compose up --detach

现在可以看到 llama_factory container 已经在正常运行了：

访问 http://server_ip:7864 则可以看到如下的 Llama-Factory WebUI 界面：

使用 Llama-Factory 为 Llama3 训练中文 LoRA

1. 模型名称与路径

进入 Llama-Factory WebUI 后，先选择 Model name 为 LLaMA-8B，这时候 Model path 会自动变成 meta-llama/Meta-Llama-3-8B 并自动从 huggingface 拉取 Meta-Llama-3-8B 模型。我这里因为已经下载了 Meta-Llama-3-8B 并映射到了 Llama-Factory container 的 /app/models/llama3-hf/Meta-Llama-3-8B 目录，所以我这里的 Model path 也设置为该路径。

2. 设置 Advanced configurations

点右边的箭头展开 Advance configurations。量化位数（Quantization bit）可以选择 4 ，减小模型的体积并提高速度。Prompt template 则选择 llama3。如果有安装 flashattn2 或者 unsloth 的话可以在 boost 里选择，我这里没有安装所以选择 none。

3. 设置训练参数

在 Train 标签页里设置训练相关的参数。主要的参数有：

Stage：设置为 Supervised Fine-Tuning
Data dir：我这里设置为 data ，因为 Llama-Factory 项目自身也带了一些中文数据集，我打算直接使用。如果你自己下载了别的中文数据集，请设置相应的数据集所在的目录地址
Dataset：我选择了 Llama-Factory 项目里自带的 alpaca_zh，alpaca_gpt4_zh 和 oaast_sft_zh 数据集
Learning rate：安装自己的需要设置。我采用了缺省的 5e-5
Epochs：按照自己的需要设置。我采用了缺省的 3.0
Cutoff length：按照自己的需要设置。数字越大，对 GPU 和显存的要求越高；数字越小，则可能对长句的语义理解不够充分。我这里选择缺省的 1024
Batch size：按照自己的需要甚至。数字越大，对 GPU 和显存的要求越高。我这里选择缺省的 2
Output dir / Config path：按照自己的需要设置

设置完成后，可以点击 Preview dataset 来查看一下数据集内容。

点击 Preview command 可以查看训练过程的命令行参数。如果不希望使用 WebUI 进行训练，则可以直接执行命令行，这样也有助于进一步编程和自动化：

点击 Save arguments 则将目前的训练设置保存到指定的 json 文件。点击 Load arguments 则可以加载以前保存好的训练设置。

4. 开始训练

参数设置好后，就可以点击 Start 开始训练。

Llama-Factory 训练脚本开始解析数据。

整个训练的过程预计 8 个小时不到一点。

训练完成后，在 /app/output/llama3_cn_train_2024-04-27-16-32-46 目录下可以看到如下的文件目录结构：

这些就是 Llama-Factory 训练出来的 LoRA。可以在自动生成的 Readme.md 文件查看 LoRA 的信息：

license: other
library_name: peft
tags:
- llama-factory
- lora
- generated_from_trainer
base_model: /app/models/llama3-hf/Meta-Llama-3-8B
model-index:
- name: llama3_cn_train_2024-04-27-16-32-46

。。。后面的内容省略

在 Ollama 中打造自己的中文版 Llama3

接下来，我们要在 Ollama 中运行 llama3 和我们训练出来的 LoRA，打造属于自己的中文版 Llama3。

Ollama 是一个开源的大模型管理工具，它提供了丰富的功能，包括模型的训练、部署、监控等。通过Ollama，你可以轻松地管理本地的大模型，提高模型的训练速度和部署效率。此外，Ollama还支持多种机器学习框架，如TensorFlow、PyTorch等，使得你可以根据自己的需求选择合适的框架进行模型的训练。

运行 Ollama docker

从 github 拉取 ollama 代码：

git clone https://github.com/ollama/ollama.git

Ollama github项目提供了参考的 Dockerfile，可以编译自己的 ollama 镜像并运行。也可以直接运行 ollama 官方发布的 docker 镜像：

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

转换 LoRA 格式

按照 Ollama modelfile ADAPTER 的说明，Ollama 支持 ggml 格式的 LoRA，所以我们需要把刚才生成的 LoRA 转换成为 ggml 格式。为此，我们需要使用到 Llama.cpp 的某些脚本。有关 Llama.cpp 开源项目的介绍请参考本站另外一篇博文：Llama.cpp 上手实战指南。

ADAPTER
The ADAPTER instruction is an optional instruction that specifies any LoRA adapter that should apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.

在 llama.cpp 项目中，有如下几个用于转换格式的 python 脚本。我们将使用 conver-lora-to-ggml.py 脚本来转换格式。

运行如下的命令（其中 /app/output/llama3_cn_train_2024-04-27-16-32-46 是 Llama-Factory 生成 LoRA 的路径）：

./conver-lora-to-ggml.py /app/output/llama3_cn_train_2024-04-27-16-32-46 llama

运行完这个命令后，将在 /app/output/llama3_cn_train_2024-04-27-16-32-46 下生成 ggml-adapter-model.bin 文件。这个文件就是 Ollama 需要的 ggml 格式的 LoRA 文件。

在 ollama 中创建自己的 llama3 中文模型

我们使用 ollama 的 modelfile 来创建自己的 llama3 中文模型。我自己使用的参考 llama3.modelfile 内容如下：

# set the base model
FROM llama3:8b

# set custom parameter values
PARAMETER temperature 1
PARAMETER num_keep 24
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER stop <|reserved_special_token

# set the model template
TEMPLATE """
{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>
"""

# set the system message
SYSTEM You are llama3 from Meta, customized and hosted @ HY's Blog (https://blog.yanghong.dev).

# set Chinese lora support
ADAPTER /root/.ollama/models/lora/ggml-adapter-model.bin

首先，通过命令行进入 ollama container：

docker exec -it ollama /bin/bash

然后使用我们刚才生成的 llama3.modelfile 来创建自己的中文版 Llama3，命名为 llama3:hy。

ollama create llama3:hy -f llama3.modelfile

创建成功后，运行如下的命令，则可以看到刚才创建的 llama3:hy 模型。

ollama list

执行如下的命令来运行 llama3:hy 模型，接下来我们就可以愉快地跟它进行中文对话了。

ollama run llama3:hy

我们同样问它本文开头的那两个问题：”你是谁？“，”介绍一下中国的首都北京“，看看它的表现如何。

你是谁？

>>> 你是谁？
我是一個 AI 型Chatbot，来自 Meta，目前被Customized 和 Hosted 在 HY 的博客（https://blog.yanghong.dev）。简单地说，我是一個能與人互動、回答問題和生成文本的 AI 机器人。

介绍一下中上海的旅游景点：

>>> 介绍一下上海的旅游景点
上海是一座流行的旅游目的地，拥有悠久历史、多样文化和现代建筑。以下是上海一些最受欢迎的旅游景点：

1. 外滨路（The Bund）：黄浦江沿岸的一带，提供城市天际线的壮丽景象。此外，这里也有一些历史古楼、餐厅和酒吧。
2. 上海塔（Shanghai Tower）：这是世界最高建筑之一，高度632米。从124层观光台可以欣赏到绕城全景。
3. 豫园（Yu Garden）：明代的中国古典花园，known for its beautiful architecture, lush greenery and peaceful atmosphere.
4. 法租区（French Concession）：一带拥有法国影响的街区，featuring colonial-era buildings, trendy boutiques和hip restaurants。
5. 新天地（Xintiandi）：历史区域被转换为时尚购物和美食街区，known for its cobblestone streets, traditional Shikumen houses和vibrant nightlife.
6. 上海博物馆（Shanghai Museum）：展示城市历史、艺术和文化的综合博物馆，有超过10,000件展品。
7. 人民广场（People's Square）：一个大型公共广场，surrounded by iconic buildings包括上海大剧院、上海市政厅和中国革命博物馆。
8. 东方明珠塔（Oriental Pearl Tower）：一座电视塔，提供城市全景观从263层观光台。
9. 朱家角水乡（Zhujiajiao Water Town）：一个古老的水镇位于上海外围，known for its canals, bridges和traditional architecture.
10. 上海迪士尼度假 resort（Disneyland Resort Shanghai）：一个世界一流的主题公园，featuring Disney, Pixar, Marvel和Star Wars attractions.

这些只是上海旅游景点之一。无论你是否感兴趣历史、文化、美食或娱乐，这座城市都有足够的选择！

可以看到，它基本上能以中文回答问题，我们的 LoRA fine tuning 过程成功完成了。

为中文 Llama3 添加 WebUI

Ollama 提供了 REST API 来跟 LLM 模型进行交互，比如最常用的 Generate，Chat 等方法。

Generate a response

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt":"Why is the sky blue?"
}'

Chat with a model

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

完整的 REST API 文档可参阅 github 。

因此，我们可以开发一个简单的 gradio 程序，通过 REST API 调用 llama3:hy 模型来进行中文交互。参考代码片段如下：

        response = requests.post('http://192.168.3.204:11434/api/generate',
            json={
                'model': model2endpoint[model_name],
                'prompt': prompt,
                #'context': context,
                'options': {
                    'top_k': top_k,
                    'temperature': top_p,
                    'top_p': temperature
                }
            },
            stream=True
        )

        yield "", history, user_message, ""
        output = ""

        # Check if the request was successful
        response.raise_for_status()

        # Initialize the output and history variables
        output = ""

        # Iterate over the streamed response lines
        for idx, line in enumerate(response.iter_lines()):
            if line:
                # Parse the line as JSON
                data = json.loads(line)
                token = data.get('response', '')  # Assuming 'response' contains the text
                # Check if the token is a special token
                if data.get('special', False):
                    continue

                # Append the token to the output
                output += token

                # Update history and chat based on the index
                if idx == 0:
                    history.append(output.strip())  # Append initial output
                else:
                    history[-1] = output.strip()  # Update the last history entry

                # Convert history to chat format
                chat = [
                    (history[i], history[i + 1]) if i + 1 < len(history) else (history[i], "")
                    for i in range(0, len(history), 2)
                ]

            # Yield the current chat, history, and user message updates
            yield chat, history, user_message, ""