使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

概述

llama.cpp的主要目标是能够在各种硬件上实现LLM推理,只需最少的设置,并提供最先进的性能。提供1.5位、2位、3位、4位、5位、6位和8位整数量化,以加快推理速度并减少内存使用。

GitHub:https://github.com/ggerganov/llama.cpp

克隆和编译

克隆最新版llama.cpp仓库代码

python

复制代码git clone https://github.com/ggerganov/llama.cpp

对llama.cpp项目进行编译,在目录下会生成一系列可执行文件

css复制代码main:使用模型进行推理

quantize:量化模型

server:提供模型API服务

1.编译构建CPU执行环境,安装简单,适用于没有GPU的操作系统

python复制代码cd llama.cpp

mkdir 

2.编译构建GPU执行环境,确保安装CUDA工具包,适用于有GPU的操作系统

如果CUDA设置正确,那么执行nvidia-sminvcc --version没有错误提示,则表示一切设置正确。

python

复制代码make clean &&  make LLAMA_CUDA=1

3.如果编译失败或者需要重新编译,可尝试清理并重新编译,直至编译成功

python

复制代码make clean

😝有需要的小伙伴,可以V扫描下方二维码免费领取🆓

## 环境准备

1.下载受支持的模型

要使用llamma.cpp,首先需要准备它支持的模型。在官方文档中给出了说明,这里仅仅截取其中一部分

在这里插入图片描述 2.安装依赖

llama.cpp项目下带有requirements.txt 文件,直接安装依赖即可。

python

复制代码pip install -r requirements.txt

模型格式转换

根据模型架构,可以使用convert.pyconvert-hf-to-gguf.py文件。

转换脚本读取模型配置、分词器、张量名称+数据,并将它们转换为GGUF元数据和张量。

GGUF格式

Llama-3相比其前两代显著扩充了词表大小,由32K扩充至128K,并且改为BPE词表。因此需要使用--vocab-type参数指定分词算法,默认值是spm,如果是bpe,需要显示指定

注意:

官方文档说convert.py不支持LLaMA 3,喊使用convert-hf-to-gguf.py,但它不支持--vocab-type,且出现异常:error: unrecognized arguments: --vocab-type bpe,因此使用convert.py且没出问题

使用llama.cpp项目中的convert.py脚本转换模型为GGUF格式

python复制代码root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.gguf
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.gguf, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   1
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   2
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   2
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   2
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   2
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   3
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   3
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   3

转换为FP16的GGUF格式,模型体积大概15G。

python复制代码root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf

bin格式

python复制代码root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.bin
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.bin, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   4
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   4
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   4
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   5
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   5
INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   5
INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
python复制代码root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf
-rw-r--r--  1 root root  15G May 17 08:02 Llama3-FP16.bin

模型量化

模型量化使用quantize命令,其具体可用参数与允许量化的类型如下:

python复制代码root@master:~/work/llama.cpp# ./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

使用quantize量化模型,它提供各种量化位数的模型:Q2、Q3、Q4、Q5、Q6、Q8、F16。

量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少,对硬件资源的要求越低,但是模型的精度也越低。

模型经过量化之后,可以发现模型的大小从15G降低到8G,但模型精度从16位浮点数降低到8位整数。

python复制代码root@master:~/work/llama.cpp# ./quantize ./models/Llama3-FP16.gguf  ./models/Llama3-q8.gguf q8_0
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/root/work/models/Llama3-FP16.gguf' to '/root/work/models/Llama3-q8.gguf' as Q8_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/work/models/Llama3-FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q8_0 .. size =  1002.00 MiB ->   532.31 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
python复制代码root@master:~/work/llama.cpp# ll -h models/
-rw-r--r--  1 root root 8.0G May 17 07:54 Llama3-q8.gguf

模型加载与推理

模型加载与推理使用main命令,其支持如下可用参数:

python复制代码root@master:~/work/llama.cpp# ./main -h

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  --version             show version and build info
  -i, --interactive     run in interactive mode
  --interactive-specials allow special tokens in user text, in interactive mode
  --interactive-first   run in interactive mode and wait for input right away
  -cnv, --conversation  run in conversation mode (does not print special tokens and suffix/prefix)
  -ins, --instruct      run in instruction mode (use with Alpaca models)
  -cml, --chatml        run in chatml mode (use with ChatML-compatible models)
  --multiline-input     allows you to write or paste multiple lines without ending each in '\'
  -r PROMPT, --reverse-prompt PROMPT
                        halt generation at PROMPT, return control in interactive mode
                        (can be specified more than once for multiple prompts).
  --color               colorise output to distinguish prompt and user input from generations
  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)
  -t N, --threads N     number of threads to use during generation (default: 30)
  -tb N, --threads-batch N
                        number of threads to use during batch and prompt processing (default: same as --threads)
  -td N, --threads-draft N                        number of threads to use during generation (default: same as --threads)
  -tbd N, --threads-batch-draft N
                        number of threads to use during batch and prompt processing (default: same as --threads-draft)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: empty)

可以加载预训练模型或者经过量化之后的模型,这里选择加载量化后的模型进行推理。

在llama.cpp项目的根目录,执行如下命令,加载模型进行推理。

python复制代码root@master:~/work/llama.cpp# ./main -m models/Llama3-q8.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
Log start
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715935175
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.
> hi
Hello! How can I help you today?<|eot_id|>

>

在提示符>之后输入prompt,使用ctrl+c中断输出,多行信息以\作为行尾。执行./main -h命令查看帮助和参数说明,以下是一些常用的参数: `

命令描述
-m指定 LLaMA 模型文件的路径
-mu指定远程 http url 来下载文件
-i以交互模式运行程序,允许直接提供输入并接收实时响应。
-ins以指令模式运行程序,这在处理羊驼模型时特别有用。
-f指定prompt模板,alpaca模型请加载prompts/alpaca.txt
-n控制回复生成的最大长度(默认:128)
-c设置提示上下文的大小,值越大越能参考更长的对话历史(默认:512)
-b控制batch size(默认:8),可适当增加
-t控制线程数量(默认:4),可适当增加
--repeat_penalty控制生成回复中对重复文本的惩罚力度
--temp温度系数,值越低回复的随机性越小,反之越大
--top_p, top_k控制解码采样的相关参数
--color区分用户输入和生成的文本

模型API服务

llama.cpp提供了完全与OpenAI API兼容的API接口,使用经过编译生成的server可执行文件启动API服务。

python复制代码root@master:~/work/llama.cpp# ./server -m models/Llama3-q8.gguf --host 0.0.0.0 --port 8000
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2942,"msg":"build info","build":2908,"commit":"359cbe3f"}
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2947,"msg":"system info","n_threads":30,"n_threads_batch":-1,"total_threads":30,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336

启动API服务后,可以使用curl命令进行测试

python复制代码curl --request POST \
    --url http://localhost:8000/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Hi"}'

模型API服务(第三方)

在llamm.cpp项目中有提到各种语言编写的第三方工具包,可以使用这些工具包提供API服务,这里以Python为例,使用llama-cpp-python提供API服务。

安装依赖

python复制代码pip install llama-cpp-python

pip install llama-cpp-python -i https://mirrors.aliyun.com/pypi/simple/

注意:可能还需要安装以下缺失依赖,可根据启动时的异常提示分别安装。

python

复制代码pip install sse_starlette starlette_context pydantic_settings

启动API服务,默认运行在http://localhost:8000

python

复制代码python -m llama_cpp.server --model models/Llama3-q8.gguf

安装openai依赖

python

复制代码pip install openai

使用openai调用API服务

python复制代码import os
from openai import OpenAI  # 导入OpenAI库

# 设置OpenAI的BASE_URL
os.environ["OPENAI_BASE_URL"] = "http://localhost:8000/v1"

client = OpenAI()  # 创建OpenAI客户端对象

# 调用模型
completion = client.chat.completions.create(
    model="llama3", # 任意填
    messages=[
        {"role": "system", "content": "你是一个乐于助人的助手。"},
        {"role": "user", "content": "你好!"}
    ]
)

# 输出模型回复
print(completion.choices[0].message)

在这里插入图片描述

GPU推理

如果编译构建了GPU执行环境,可以使用-ngl N --n-gpu-layers N参数,指定offload层数,让模型在GPU上运行推理

例如:-ngl 40表示offload 40层模型参数到GPU

未使用-ngl N --n-gpu-layers N参数,程序默认在CPU上运行

python

复制代码root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000

可从以下关键启动日志看出,模型并没有在GPU上执行

python复制代码ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512

使用-ngl N --n-gpu-layers N参数,程序默认在GPU上运行

python

复制代码root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000   --n-gpu-layers 1000

可从以下关键启动日志看出,模型在GPU上执行

python复制代码ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0

执行nvidia-smi命令,可以进一步验证模型已在GPU上运行。 在这里插入图片描述

https://juejin.cn/theme/detail/7218019389664067621?contentType=1)

那么,我们该如何学习大模型?

作为一名热心肠的互联网老兵,我决定把宝贵的AI知识分享给大家。 至于能学习到多少就看你的学习毅力和能力了 。我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

一、大模型全套的学习路线

学习大型人工智能模型,如GPT-3、BERT或任何其他先进的神经网络模型,需要系统的方法和持续的努力。既然要系统的学习大模型,那么学习路线是必不可少的,下面的这份路线能帮助你快速梳理知识,形成自己的体系。

L1级别:AI大模型时代的华丽登场

L2级别:AI大模型API应用开发工程

L3级别:大模型应用架构进阶实践

L4级别:大模型微调与私有化部署

一般掌握到第四个级别,市场上大多数岗位都是可以胜任,但要还不是天花板,天花板级别要求更加严格,对于算法和实战是非常苛刻的。建议普通人掌握到L4级别即可。

以上的AI大模型学习路线,不知道为什么发出来就有点糊,高清版可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费

二、640套AI大模型报告合集

这套包含640份报告的合集,涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师,还是对AI大模型感兴趣的爱好者,这套报告合集都将为您提供宝贵的信息和启示。

img

三、大模型经典PDF籍

随着人工智能技术的飞速发展,AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型,如GPT-3、BERT、XLNet等,以其强大的语言理解和生成能力,正在改变我们对人工智能的认识。 那以下这些PDF籍就是非常不错的学习资源。

img

四、AI大模型商业化落地方案

img

作为普通人,入局大模型时代需要持续学习和实践,不断提高自己的技能和认知水平,同时也需要有责任感和伦理意识,为人工智能的健康发展贡献力量。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/711206.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

师彼长技以助己(6)递归思维

师彼长技以助己&#xff08;6&#xff09;递归思维 递归思维-小游戏 思维小游戏 思维 小游戏&#xff1a;1 玩一个从1或2开始往上加的游戏&#xff0c;谁加到20就赢 如何保证一定赢呢&#xff1f;我们倒推&#xff0c;要先到20的话&#xff0c;谁先到17就赢&#xff0c;如此…

固态u盘长期不用会丢数据吗?u盘数据丢失怎么恢复需要多久

在数字化时代&#xff0c;U盘作为便携存储设备&#xff0c;广泛应用于我们的日常生活和工作中。然而&#xff0c;关于固态U盘长期不使用是否会导致数据丢失的问题&#xff0c;以及数据丢失后如何恢复和所需的时间&#xff0c;常常让人感到困惑。本文将针对这些问题进行深入探讨…

汽车IVI中控开发入门及进阶(二十八):视频SERDES芯片

前言: SerDes不是很常见,SerDes是将Ser和Des两种产品组合在一起的名称。Ser是Serializer或“并串转换器”的缩写,Des是Deserializer或“串并转换器”的简写。 Serdes是不是必须的?上一节介绍了camera,上上节也研究了video decoder,那么带摄像头的应用应该具体选哪个方案…

建筑电工精选最新模拟试题(含答案)

一、填空题 1、我国安全生产的基本方针是 安全 第一&#xff0c;预防 为主&#xff0c;综合治理。 2、特种作业人员&#xff0c;必须积极主动参加培训与考核 。既是法律法规的规定&#xff0c;也是自身工作&#xff0c;生产及生命安全 的需要 3、触电急救&#x…

修改SubVI的LabVIEW默认搜索路径

在启动顶级VI后&#xff0c;LabVIEW可能会遇到找不到subVI的情况。这通常是由于subVI的路径发生了变化或没有被正确配置。 LabVIEW默认搜索路径 默认情况下&#xff0c;LabVIEW会按以下顺序搜索文件位置&#xff08;*表示LabVIEW将搜索子目录&#xff09;&#xff1a; <t…

锐尔15 锐尔文档扫描影像处理系统

锐尔文档扫描影像处理系统是一款全中文操作界面的文件、档案扫描及影像优化处理软件&#xff0c;是目前国内档案数字化行业里专业且优秀的影像优化处理软件&#xff0c;在档案数字化业内有广泛的影响且广受好评。 无论是从纸质文件制作高质量的影像文件&#xff0c;或是检查已经…

OpenDevin 环境配置及踩坑指南

不惧怕任何环境配置 首先 clone 项目&#xff0c;然后查看开发者文档&#xff1a;https://github.com/OpenDevin/OpenDevin/blob/main/Development.md make setup-config 自定义 LLM 配置 首先这个 devin 写的是支持自定义的 LLM 配置&#xff0c;并且提供了交互式命令供我们…

【Matlab编程学习】 | matlab语言编程基础:常用图形绘制基础学习

&#x1f3a9; 欢迎来到技术探索的奇幻世界&#x1f468;‍&#x1f4bb; &#x1f4dc; 个人主页&#xff1a;一伦明悦-CSDN博客 ✍&#x1f3fb; 作者简介&#xff1a; C软件开发、Python机器学习爱好者 &#x1f5e3;️ 互动与支持&#xff1a;&#x1f4ac;评论 &…

【仪器仪表/电源专题】浮地信号的测试的四种方案对比

接地信号和浮地信号区别 所有的电压测量都是差分测量&#xff0c;差分测量定义为两点之间的电压差。所以会分成两类&#xff1a; 1.参考地电平测量&#xff08;有时也叫接地信号&#xff09; 2.非参考地电平测量&#xff08;也称为浮地测量&#xff09; 测试信号可以分为接地信…

leetcode 56合并区间

思路 合并就是首先应该按照left左边界排序&#xff0c;排完序以后&#xff0c;如果i的左边界小于等于i-1的右边界&#xff0c;说明有重合&#xff0c;此时这两个可以合并&#xff0c;右边界应该取最大值。 代码 排序 我是定义了一个类,存储左右边界&#xff0c;先将数组转化…

【计算机视觉】人脸算法之图像处理基础知识(五)

图像的几何变换 3.图像的旋转 图像的旋转就是让图像按照某一点旋转到指定的角度。需要确定3个参数&#xff1a;图像的旋转中心、旋转角度和缩放因子。在openv中通过getRotationMatrix2D()函数来实现图像的旋转。 import cv2 import numpy as npimgpath "images/img1.j…

logTrick

贴一下灵神的题解里面的解释~ 就是一种优化策略&#xff0c;logtrick class Solution { public:int minimumDifference(vector<int>& nums, int k) {int res 0x3f3f3f3f;int n nums.size();for(int i0;i<n;i){res min(res,abs(nums[i]-k));for(int j i-1;j&g…

视频转换器在线哪个好?让视频播放不受格式限制

在日常的视频观看中&#xff0c;我们可能会遇到视频格式与设备不兼容的问题&#xff0c;导致无法顺畅播放。这就像是缺少了播放的钥匙&#xff0c;让人无法享受视频内容。 面对视频格式不兼容的挑战&#xff0c;选择合适的转换工具至关重要。但不用担心&#xff0c;本文将分享…

OpenGL3.3_C++_Windows(10)

最终演示 ​ demo演示 Assimp模型渲染 模型导入库Assimp&#xff1a;导入很多种不同的模型文件格式&#xff0c;加载至Assimp的通用数据结构&#xff08;树形&#xff09;中&#xff0c;不论导入的是什么种类的文件格式&#xff0c;用同一种方式访问我们需要的数据。 Assimp库…

从开源EPR产品Odoo学习

前言 一个先进、敏捷、经济高效、可快速扩展的Odoo免费开源企业信息化解决方案,让企业获得适应未来发展的长期创新和增长能力。 Odoo 的免费开源模式 让我们可利用无数开发人员和业务专家,在短短数年内,打造数百款应用。凭借强大的技术基础,Odoo 的框架是非常独特且优秀的…

0614,表达式,语句

题目一&#xff1a; 许多简单的交互式程序都是基于菜单的&#xff1a;它们向用户显示可供选择的命令列表&#xff1b;一旦用户选择了某条命令&#xff0c;程序就执行相应的操作&#xff0c;然后提示用户输入下一条命令&#xff1b;这个过程一直会持续到用户选择 "退出&qu…

图论(一)之概念介绍与图形#matlab

图论&#xff08;一&#xff09;之概念介绍与图形目录 前言 一、图论介绍 二、基本概念 2.1图的概念 2.2图形分类 2.3邻接矩阵 2.3.1无向图 2.3.2有向图 2.3.3有向赋权图 2.4出度&#xff08;Outdegree&#xff09; 2.5入度&#xff08;Indegree&#xff09; 3.四种…

Apollo9.0 PNC源码学习之Control模块(五)—— 基于LQR的横向控制

前面文章&#xff1a; Apollo9.0 PNC源码学习之Control模块&#xff08;一&#xff09; Apollo9.0 PNC源码学习之Control模块&#xff08;二&#xff09; Apollo9.0 PNC源码学习之Control模块&#xff08;三&#xff09; Apollo9.0 PNC源码学习之Control模块&#xff08;四&…

GenICam标准(六)

系列文章目录 GenICam标准&#xff08;一&#xff09; GenICam标准&#xff08;二&#xff09; GenICam标准&#xff08;三&#xff09; GenICam标准&#xff08;四&#xff09; GenICam标准&#xff08;五&#xff09; GenICam标准&#xff08;六&#xff09; 文章目录 系列文…

鸿蒙用 BuilderParam 实现同一个布局不同内容组件

面通过一个案例展示BuilderParam的具体用法&#xff0c;例如&#xff0c;现需要实现一个通用的卡片组件&#xff0c;如下图所示 卡片中显示的内容不固定&#xff0c;例如 具体实现代码如下&#xff1a; Entry Component struct BuildParamDemo {build() {Column(){Card() {imag…