下载源码
cd ~/Downloads/ai
git clone --depth=1 https://gitee.com/ymcui/Chinese-LLaMA-Alpaca-2
创建venv
python3 -m venv venv
source venv/bin/activate
安装依赖
pip install -r requirements.txt
已安装依赖列表
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ pip list
Package Version
------------------------ ----------
accelerate 0.26.1
bitsandbytes 0.41.1
certifi 2023.11.17
charset-normalizer 3.3.2
cmake 3.28.1
filelock 3.13.1
fsspec 2023.12.2
huggingface-hub 0.17.3
idna 3.6
Jinja2 3.1.3
lit 17.0.6
MarkupSafe 2.1.3
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 23.2
peft 0.3.0
pip 22.0.2
psutil 5.9.7
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 59.6.0
sympy 1.12
tokenizers 0.14.1
torch 2.0.1
tqdm 4.66.1
transformers 4.35.0
triton 2.0.0
typing_extensions 4.9.0
urllib3 2.1.0
wheel 0.42.0
下载编译llama.cpp
cd ~/Downloads/ai/
git clone --depth=1 https://gh.api.99988866.xyz/https://github.com/ggerganov/llama.cpp
cd llma.cpp
make -j6
编译成功
创建软链接
cd ~/Downloads/ai/Chinese-LLaMA-Alpaca-2/scripts/llama-cpp/
ln -s ~/Downloads/ai/llama.cpp/main .
下载模型
由于只有6G显存,只下载基础的对话模型chinese-alpaca-2-1.3b
浏览器打开地址:hfl/chinese-alpaca-2-1.3b at main
放到~/Downloads/ai 目录下
启动chat报错
继续折腾:
这两个文件需要手动在浏览器内下载到~/Downloads/ai/chinese-alpaca-2-1.3b
参考文档
转换模型
rm models/ -rf
mkdir models
cp ~/Downloads/ai/chinese-alpaca-2-1.3b models/ -v
python ~/Downloads/ai/llama.cpp/convert.py models/chinese-alpaca-2-1.3b/
转换日志
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ python ~/Downloads/ai/llama.cpp/convert.py models/chinese-alpaca-2-1.3b/
/home/yeqiang/Downloads/ai/llama.cpp/gguf-py
Loading model file models/chinese-alpaca-2-1.3b/pytorch_model.bin
params = Params(n_vocab=55296, n_embd=4096, n_layer=4, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, n_experts=None, n_experts_used=None, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('models/chinese-alpaca-2-1.3b'))
Loading vocab file 'models/chinese-alpaca-2-1.3b/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
model.embed_tokens.weight -> token_embd.weight | F16 | [55296, 4096]
model.layers.0.self_attn.q_proj.weight -> blk.0.attn_q.weight | F16 | [4096, 4096]
model.layers.0.self_attn.k_proj.weight -> blk.0.attn_k.weight | F16 | [4096, 4096]
model.layers.0.self_attn.v_proj.weight -> blk.0.attn_v.weight | F16 | [4096, 4096]
model.layers.0.self_attn.o_proj.weight -> blk.0.attn_output.weight | F16 | [4096, 4096]
skipping tensor blk.0.attn_rot_embd
model.layers.0.mlp.gate_proj.weight -> blk.0.ffn_gate.weight | F16 | [11008, 4096]
model.layers.0.mlp.up_proj.weight -> blk.0.ffn_up.weight | F16 | [11008, 4096]
model.layers.0.mlp.down_proj.weight -> blk.0.ffn_down.weight | F16 | [4096, 11008]
model.layers.0.input_layernorm.weight -> blk.0.attn_norm.weight | F16 | [4096]
model.layers.0.post_attention_layernorm.weight -> blk.0.ffn_norm.weight | F16 | [4096]
model.layers.1.self_attn.q_proj.weight -> blk.1.attn_q.weight | F16 | [4096, 4096]
model.layers.1.self_attn.k_proj.weight -> blk.1.attn_k.weight | F16 | [4096, 4096]
model.layers.1.self_attn.v_proj.weight -> blk.1.attn_v.weight | F16 | [4096, 4096]
model.layers.1.self_attn.o_proj.weight -> blk.1.attn_output.weight | F16 | [4096, 4096]
skipping tensor blk.1.attn_rot_embd
model.layers.1.mlp.gate_proj.weight -> blk.1.ffn_gate.weight | F16 | [11008, 4096]
model.layers.1.mlp.up_proj.weight -> blk.1.ffn_up.weight | F16 | [11008, 4096]
model.layers.1.mlp.down_proj.weight -> blk.1.ffn_down.weight | F16 | [4096, 11008]
model.layers.1.input_layernorm.weight -> blk.1.attn_norm.weight | F16 | [4096]
model.layers.1.post_attention_layernorm.weight -> blk.1.ffn_norm.weight | F16 | [4096]
model.layers.2.self_attn.q_proj.weight -> blk.2.attn_q.weight | F16 | [4096, 4096]
model.layers.2.self_attn.k_proj.weight -> blk.2.attn_k.weight | F16 | [4096, 4096]
model.layers.2.self_attn.v_proj.weight -> blk.2.attn_v.weight | F16 | [4096, 4096]
model.layers.2.self_attn.o_proj.weight -> blk.2.attn_output.weight | F16 | [4096, 4096]
skipping tensor blk.2.attn_rot_embd
model.layers.2.mlp.gate_proj.weight -> blk.2.ffn_gate.weight | F16 | [11008, 4096]
model.layers.2.mlp.up_proj.weight -> blk.2.ffn_up.weight | F16 | [11008, 4096]
model.layers.2.mlp.down_proj.weight -> blk.2.ffn_down.weight | F16 | [4096, 11008]
model.layers.2.input_layernorm.weight -> blk.2.attn_norm.weight | F16 | [4096]
model.layers.2.post_attention_layernorm.weight -> blk.2.ffn_norm.weight | F16 | [4096]
model.layers.3.self_attn.q_proj.weight -> blk.3.attn_q.weight | F16 | [4096, 4096]
model.layers.3.self_attn.k_proj.weight -> blk.3.attn_k.weight | F16 | [4096, 4096]
model.layers.3.self_attn.v_proj.weight -> blk.3.attn_v.weight | F16 | [4096, 4096]
model.layers.3.self_attn.o_proj.weight -> blk.3.attn_output.weight | F16 | [4096, 4096]
skipping tensor blk.3.attn_rot_embd
model.layers.3.mlp.gate_proj.weight -> blk.3.ffn_gate.weight | F16 | [11008, 4096]
model.layers.3.mlp.up_proj.weight -> blk.3.ffn_up.weight | F16 | [11008, 4096]
model.layers.3.mlp.down_proj.weight -> blk.3.ffn_down.weight | F16 | [4096, 11008]
model.layers.3.input_layernorm.weight -> blk.3.attn_norm.weight | F16 | [4096]
model.layers.3.post_attention_layernorm.weight -> blk.3.ffn_norm.weight | F16 | [4096]
model.norm.weight -> output_norm.weight | F16 | [4096]
lm_head.weight -> output.weight | F16 | [55296, 4096]
Writing models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
[ 1/39] Writing tensor token_embd.weight | size 55296 x 4096 | type F16 | T+ 1
[ 2/39] Writing tensor blk.0.attn_q.weight | size 4096 x 4096 | type F16 | T+ 1
[ 3/39] Writing tensor blk.0.attn_k.weight | size 4096 x 4096 | type F16 | T+ 1
[ 4/39] Writing tensor blk.0.attn_v.weight | size 4096 x 4096 | type F16 | T+ 1
[ 5/39] Writing tensor blk.0.attn_output.weight | size 4096 x 4096 | type F16 | T+ 1
[ 6/39] Writing tensor blk.0.ffn_gate.weight | size 11008 x 4096 | type F16 | T+ 1
[ 7/39] Writing tensor blk.0.ffn_up.weight | size 11008 x 4096 | type F16 | T+ 1
[ 8/39] Writing tensor blk.0.ffn_down.weight | size 4096 x 11008 | type F16 | T+ 1
[ 9/39] Writing tensor blk.0.attn_norm.weight | size 4096 | type F32 | T+ 2
[10/39] Writing tensor blk.0.ffn_norm.weight | size 4096 | type F32 | T+ 2
[11/39] Writing tensor blk.1.attn_q.weight | size 4096 x 4096 | type F16 | T+ 2
[12/39] Writing tensor blk.1.attn_k.weight | size 4096 x 4096 | type F16 | T+ 2
[13/39] Writing tensor blk.1.attn_v.weight | size 4096 x 4096 | type F16 | T+ 2
[14/39] Writing tensor blk.1.attn_output.weight | size 4096 x 4096 | type F16 | T+ 2
[15/39] Writing tensor blk.1.ffn_gate.weight | size 11008 x 4096 | type F16 | T+ 2
[16/39] Writing tensor blk.1.ffn_up.weight | size 11008 x 4096 | type F16 | T+ 2
[17/39] Writing tensor blk.1.ffn_down.weight | size 4096 x 11008 | type F16 | T+ 2
[18/39] Writing tensor blk.1.attn_norm.weight | size 4096 | type F32 | T+ 2
[19/39] Writing tensor blk.1.ffn_norm.weight | size 4096 | type F32 | T+ 2
[20/39] Writing tensor blk.2.attn_q.weight | size 4096 x 4096 | type F16 | T+ 2
[21/39] Writing tensor blk.2.attn_k.weight | size 4096 x 4096 | type F16 | T+ 2
[22/39] Writing tensor blk.2.attn_v.weight | size 4096 x 4096 | type F16 | T+ 2
[23/39] Writing tensor blk.2.attn_output.weight | size 4096 x 4096 | type F16 | T+ 2
[24/39] Writing tensor blk.2.ffn_gate.weight | size 11008 x 4096 | type F16 | T+ 2
[25/39] Writing tensor blk.2.ffn_up.weight | size 11008 x 4096 | type F16 | T+ 2
[26/39] Writing tensor blk.2.ffn_down.weight | size 4096 x 11008 | type F16 | T+ 2
[27/39] Writing tensor blk.2.attn_norm.weight | size 4096 | type F32 | T+ 2
[28/39] Writing tensor blk.2.ffn_norm.weight | size 4096 | type F32 | T+ 2
[29/39] Writing tensor blk.3.attn_q.weight | size 4096 x 4096 | type F16 | T+ 2
[30/39] Writing tensor blk.3.attn_k.weight | size 4096 x 4096 | type F16 | T+ 2
[31/39] Writing tensor blk.3.attn_v.weight | size 4096 x 4096 | type F16 | T+ 2
[32/39] Writing tensor blk.3.attn_output.weight | size 4096 x 4096 | type F16 | T+ 2
[33/39] Writing tensor blk.3.ffn_gate.weight | size 11008 x 4096 | type F16 | T+ 3
[34/39] Writing tensor blk.3.ffn_up.weight | size 11008 x 4096 | type F16 | T+ 3
[35/39] Writing tensor blk.3.ffn_down.weight | size 4096 x 11008 | type F16 | T+ 4
[36/39] Writing tensor blk.3.attn_norm.weight | size 4096 | type F32 | T+ 4
[37/39] Writing tensor blk.3.ffn_norm.weight | size 4096 | type F32 | T+ 4
[38/39] Writing tensor output_norm.weight | size 4096 | type F32 | T+ 4
[39/39] Writing tensor output.weight | size 55296 x 4096 | type F16 | T+ 4
Wrote models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$
进一步对FP16模型进行4-bit量化
~/Downloads/ai/llama.cpp/quantize models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin q4_0
日志
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ ~/Downloads/ai/llama.cpp/quantize models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin q4_0
main: build = 1 (5c99960)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf' to 'models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin' as Q4_0
llama_model_loader: loaded meta data with 21 key-value pairs and 39 tensors from models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 4
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,55296] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,55296] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,55296] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 9 tensors
llama_model_loader: - type f16: 30 tensors
llama_model_quantize_internal: meta size = 1233920 bytes
[ 1/ 39] token_embd.weight - [ 4096, 55296, 1, 1], type = f16, quantizing to q4_0 .. size = 432.00 MiB -> 121.50 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.110 0.116 0.110 0.096 0.077 0.057 0.039 0.026 0.021
[ 2/ 39] blk.0.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.016 0.027 0.040 0.056 0.074 0.092 0.109 0.121 0.110 0.093 0.076 0.058 0.042 0.027 0.021
[ 3/ 39] blk.0.attn_k.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.035 0.012 0.019 0.031 0.047 0.069 0.097 0.130 0.152 0.130 0.097 0.069 0.047 0.030 0.019 0.015
[ 4/ 39] blk.0.attn_v.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.115 0.123 0.115 0.097 0.075 0.054 0.037 0.024 0.020
[ 5/ 39] blk.0.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.035 0.012 0.020 0.032 0.049 0.072 0.099 0.126 0.138 0.126 0.100 0.072 0.049 0.032 0.020 0.017
[ 6/ 39] blk.0.ffn_gate.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 7/ 39] blk.0.ffn_up.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021
[ 8/ 39] blk.0.ffn_down.weight - [11008, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 9/ 39] blk.0.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 10/ 39] blk.0.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 11/ 39] blk.1.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.013 0.021 0.033 0.050 0.072 0.098 0.123 0.137 0.123 0.098 0.072 0.050 0.033 0.021 0.017
[ 12/ 39] blk.1.attn_k.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.013 0.021 0.033 0.050 0.073 0.098 0.123 0.136 0.123 0.099 0.073 0.051 0.033 0.021 0.017
[ 13/ 39] blk.1.attn_v.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.114 0.122 0.114 0.097 0.076 0.055 0.038 0.024 0.020
[ 14/ 39] blk.1.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020
[ 15/ 39] blk.1.ffn_gate.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 16/ 39] blk.1.ffn_up.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 17/ 39] blk.1.ffn_down.weight - [11008, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 18/ 39] blk.1.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 19/ 39] blk.1.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 20/ 39] blk.2.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.116 0.125 0.116 0.097 0.075 0.054 0.037 0.024 0.020
[ 21/ 39] blk.2.attn_k.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.116 0.126 0.116 0.097 0.075 0.054 0.037 0.024 0.019
[ 22/ 39] blk.2.attn_v.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 23/ 39] blk.2.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 24/ 39] blk.2.ffn_gate.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 25/ 39] blk.2.ffn_up.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 26/ 39] blk.2.ffn_down.weight - [11008, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 27/ 39] blk.2.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 28/ 39] blk.2.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 29/ 39] blk.3.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.121 0.113 0.097 0.076 0.055 0.038 0.025 0.020
[ 30/ 39] blk.3.attn_k.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020
[ 31/ 39] blk.3.attn_v.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 32/ 39] blk.3.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MiB -> 9.00 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 33/ 39] blk.3.ffn_gate.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 34/ 39] blk.3.ffn_up.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 35/ 39] blk.3.ffn_down.weight - [11008, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MiB -> 24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 36/ 39] blk.3.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 37/ 39] blk.3.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 38/ 39] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 39/ 39] output.weight - [ 4096, 55296, 1, 1], type = f16, quantizing to q6_K .. size = 432.00 MiB -> 177.19 MiB
llama_model_quantize_internal: model size = 2408.14 MB
llama_model_quantize_internal: quant size = 733.08 MB
llama_model_quantize_internal: hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021
main: quantize time = 5131.57 ms
main: total time = 5131.57 ms
启动chat.sh
mv scripts/llama-cpp/main .
bash scripts/llama-cpp/chat.sh models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin
启动成功了,日志
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ bash scripts/llama-cpp/chat.sh models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin
Log start
main: build = 1 (5c99960)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1705481300
llama_model_loader: loaded meta data with 22 key-value pairs and 39 tensors from models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 4
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,55296] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,55296] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,55296] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 9 tensors
llama_model_loader: - type q4_0: 29 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 55296
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.26 B
llm_load_print_meta: model size = 733.08 MiB (4.87 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.01 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/5 layers to GPU
llm_load_tensors: CPU buffer size = 733.08 MiB
..............................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model: CPU compute buffer size = 288.00 MiB
system_info: n_threads = 8 / 6 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
Input prefix with BOS
Input prefix: ' [INST] '
Input suffix: ' [/INST]'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
[INST] <<SYS>>
You are a helpful assistant. 你是一个乐于助人的助手。
<</SYS>>
[/INST] 您好,有什么我可以帮助您的吗?
[INST]
这是完全基于CPU实现的?
编译llama.cpp项目没有启动cuda?
-----
试试web
参考资料
安装gradio
pip install gradio
报错
git下载模型,报错
手动把之前的模型拷贝进目录
启动gradio
安装xformers
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ pip install xformers scipy
崩溃了。。。
github加速参考:
FAST-GitHub | Fast-GitHub
huggingface加速参考
hfl/chinese-alpaca-2-1.3b at main