【项目实战】HuggingFace教程，初步实战，使用HF做一些小型任务

Huggingface教程

一、前期准备工作
二、学习pipline
- 2.1.试运行代码，使用HuggingFace下载模型
- 2.2. 例子1，情感检测分析(只有积极和消极两个状态)
- 2.3. 例子2，文本生成
三、学会使用Tokenizer & Model
- 3.1.tokenizer（分词器）是处理文本数据的重要组件
- 3.2.tokenizer对字符串处理过程
四、pytorch的简单使用
五、模型的保存save & 加载load
六、学会自己找一个模型来玩
七、微调Finetune自己的模型
八、学会使用huggingface的文档！

一、前期准备工作

1.会使用Conda创建自己的虚拟环境
2.会激活自己的虚拟环境
3.了解一定pytorch基础

官方要求：
python 3.6+
pytorch 1.1.0+
TensorFlow 2.0+
本文使用的环境：
python 3.7.1
pytorch 1.13.1 py3.7_cuda11.7_cudnn8_0
tensorflow 1.15.0
官方要求图片实例

二、学习pipline

pipeline是Hugging Face Transformers库中的一个高层API，旨在简化各种自然语言处理任务的执行。通过它，用户可以在几行代码内实现从模型加载到推理的整个流程，而不需要深入了解模型的架构细节。

pipeline 支持多种常见任务，包括：

文本分类（如情感分析）：对输入文本进行分类，返回类别标签和置信度。
问答：基于上下文回答问题。
文本生成（如对话生成）：基于输入提示生成文本片段。
翻译：将文本从一种语言翻译成另一种语言。
填空（填充掩码）：完成缺失的词或短语，适用于填空任务。

通过指定任务名称，如pipeline("sentiment-analysis")，可以直接加载相关的预训练模型和分词器，使开发过程更加高效直观。

2.1.试运行代码，使用HuggingFace下载模型

Hugging Face的pipeline方法下载的模型默认会保存在用户目录下的.cache文件夹中，具体路径是：
在这里插入图片描述

C:/Users/11874/.cache/huggingface/transformers/

在这里插入图片描述

这里的代码是因为需要网络代理（科学上网）才可以下载huggingface的模型
查找自己网络代理中的端口号，本文的端口号是7890

2.2. 例子1，情感检测分析(只有积极和消极两个状态)

# 这里从Hugging Face的Transformers库中导入pipeline函数
# pipeline是一个高层API，便于直接调用预训练模型完成特定任务。
from transformers import pipeline

# 这里使用Python的os模块设置了环境变量，将代理服务器的地址和端口号设置为
# 127.0.0.1:7890。这种设置通常用于需要通过代理访问互联网的情况
# 帮助解决从Hugging Face Hub下载模型时的网络连接问题。
import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

# 这里通过pipeline创建了一个情感分析任务的管道。它会自动下载并加载一个适合情感分析的预训练模型（如基于BERT或DistilBERT的模型），并准备好用于推理。
classifier = pipeline("sentiment-analysis")

# 这里调用classifier对输入的句子执行情感分析。模型会根据句子内容预测情感标签（例如"positive"或"negative"），并返回分类结果及其置信度。
res = classifier("I have been waiting for a HuggingFace course my whole life.")

print(res)

输出：

[{'label': 'NEGATIVE', 'score': 0.9484281539916992}]

2.3. 例子2，文本生成

from transformers import pipeline

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

# 通过pipeline函数创建了一个文本生成的管道，并指定模型为distilgpt2，
# 这是一个较轻量的GPT-2模型。pipeline会自动从Hugging Face Hub下载该模型。
generator = pipeline("text-generation", model="distilgpt2")

res = generator(
    "In my home, I have a", # 对此内容扩写
    max_length = 30, # 生成的文本长度上限为30个token。
    num_return_sequences = 2, # 生成两个不同的文本序列，提供不同的生成结果。
)

print(res)

输出：

[{'generated_text': 'In my home, I have a daughter, my son and her own daughter, and I have a son and daughter whose mom has been a patient with'}, 
{'generated_text': 'In my home, I have a couple dogs. Those were all my pets.\n“I started out in the farmhouse. I used to'}]

三、学会使用Tokenizer & Model

3.1.tokenizer（分词器）是处理文本数据的重要组件

exp1代码：使用了Hugging Face提供的高层次pipeline接口，默认加载一个预训练的情感分析模型。这种方式简单易用，适合快速原型开发，但使用的具体模型和tokenizer不明确。
exp2代码：则显式地加载了特定的模型distilbert-base-uncased-finetuned-sst-2-english及其对应的tokenizer。通过AutoModelForSequenceClassification和AutoTokenizer，用户可以更灵活地选择和定制模型。这种方式适合对模型进行微调或需要特定模型功能的情况。

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

# exp 1:这段代码使用了Hugging Face提供的高层次pipeline接口，默认加载一个预训练的情感分析模型。这种方式简单易用，适合快速原型开发，但使用的具体模型和tokenizer不明确。
classifier = pipeline("sentiment-analysis")
res = classifier("I have been waiting for a HuggingFace course my whole life.")
print(res)

# exp 2:这段代码则显式地加载了特定的模型distilbert-base-uncased-finetuned-sst-2-english及其对应的tokenizer。通过AutoModelForSequenceClassification和AutoTokenizer，用户可以更灵活地选择和定制模型。这种方式适合对模型进行微调或需要特定模型功能的情况。
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
res = classifier("I have been waiting for a HuggingFace course my whole life.")
print(res)

输出：

[{'label': 'POSITIVE', 'score': 0.9433633089065552}]
[{'label': 'POSITIVE', 'score': 0.9433633089065552}]

3.2.tokenizer对字符串处理过程

from transformers import AutoTokenizer, AutoModelForSequenceClassification

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

#这一步使用tokenizer对输入的句子进行编码。
sequence = "Playing computer game is simple."
res = tokenizer(sequence)
print(res) # {'input_ids': [101, 2652, 3274, 2208, 2003, 3722, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

# 这一行调用tokenizer.tokenize()方法将句子拆分为tokens（子词或词）。
# 在自然语言处理（NLP）中，分词通常是将文本切分为可以被模型处理的最小单位。
# 输出结果将是一个tokens的列表，例如：
tokens = tokenizer.tokenize(sequence) 
print(tokens) # ['Playing', 'computer', 'game', 'is', 'simple', '.']

# 这里使用tokenizer.convert_tokens_to_ids()方法将分词后的tokens转换为对应的ID。
# 每个token都有一个唯一的ID，这些ID可以被模型理解。
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids) # [2652, 3274, 2208, 2003, 3722, 1012]

# 使用tokenizer.decode()方法将token ID转换回原始字符串。这个过程将ID映射回对应的tokens，并将它们合并成一个可读的文本。
decoded_string = tokenizer.decode(ids)
print(decoded_string) # playing computer game is simple.

输出：

{'input_ids': [101, 2652, 3274, 2208, 2003, 3722, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
['playing', 'computer', 'game', 'is', 'simple', '.']
[2652, 3274, 2208, 2003, 3722, 1012]
playing computer game is simple.

四、pytorch的简单使用

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

# 1.还是用之前的pipeline应用
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# 2.通常不止一个句子，这里我们多放一个句子，用list
text_Train = ["I love you, my wife.",
              "Suzhou is the worst place!"]
# 3.输出分析的情感
res = classifier(text_Train)
print(res)

# 4.使用分词器对输入的文本进行批处理，设置padding=True和truncation=True确保输入序列的长度一致。
max_length=512限制了输入的最大长度，return_tensors="pt"表示将输出转换为PyTorch的张量格式。
batch = tokenizer(text_Train, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)

# 5.torch.no_grad()用于关闭梯度计算，以节省内存和加速计算，因为在推理过程中不需要更新模型参数。
# model(**batch)将预处理过的批量输入传递给模型，返回的outputs包含了模型的原始输出（logits）。
# 使用F.softmax(outputs.logits, dim=1)计算每个类的概率分布。
# torch.argmax(predictions, dim=1)用于确定概率最高的类别标签，表示每个输入文本的最终情感预测。
with torch.no_grad():
    print("====================")
    outputs = model(**batch)
    print(outputs)
    predictions = F.softmax(outputs.logits, dim=1)
    print(predictions)
    labels = torch.argmax(predictions, dim=1)
    print(labels)

输出：
在这里插入图片描述

五、模型的保存save & 加载load

1.保存：这里指定了保存路径为当前工作目录下的一个名为saved的文件夹。如果没有特殊路径指定，模型和分词器会默认保存在你运行代码的当前目录下的saved文件夹中。这个文件夹会包含：

tokenizer配置文件，例如tokenizer_config.json，vocab.txt，special_tokens_map.json等。
模型配置文件和权重，如config.json和pytorch_model.bin，这些文件包含模型的结构和权重。

2.加载：代码在后续使用时可以通过指定保存的目录（例如saved）来加载已经保存的模型和分词器，避免重新下载模型。这在离线使用或需要跨项目共享时非常有用。

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

import os
os.environ["http_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890"# ！！！选自己代理的端口号，每个人的不一样

# 还是用之前的pipeline应用
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# 1.保存: tokenizer和model到一个目录下面
save_directory = "saved"  # 保存路径为当前工作目录下的一个名为saved的文件夹
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# 2.加载：想再次加载他，使用下列方法
tok = AutoTokenizer.from_pretrained(save_directory)
mod = AutoModelForSequenceClassification.from_pretrained(save_directory)

输出:
在这里插入图片描述

六、学会自己找一个模型来玩

打开huggingface官网链接
点击Model
3.模型过滤器：寻找自己需要的模型，下图，红色框框内是模型的：任务需求、所用的库、数据集、关于什么语言、许可证

Task：用于指定模型适合执行的任务类型，例如文本分类、翻译、图像生成等。
Libraries：指定模型所属的框架或库，例如 Transformers、Diffusers、TensorFlow、PyTorch 等。
Dataset：用于筛选经过特定数据集训练或验证的模型。
Language：用于筛选支持特定语言的模型。如英语、中文、法语。
License：用于筛选模型的使用许可，例如 MIT、Apache-2.0、或 CC-BY 等。不同的许可证会规定模型在商业和非商业项目中的使用限制。

在这里插入图片描述
4.实战，facebook对文本的总结模型：

task中勾选summarization
选择facebook/bart-large-cnn模型
点进去后，有该模型的介绍，向下滑动，会有教你怎么使用这个模型，将红框内代码内容复制到python中跑一次，记住要使用代理的端口号

红框内代码如下，加上代理相关代码：
pipeline中的参数代表就是上面介绍的过滤器

# 下列两行代码是一个意思，主要是展示pipeline中的参数代表就是上面介绍的过滤器
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summarizer = pipeline(task = "summarization", model="facebook/bart-large-cnn")

全部代码如下：

from transformers import pipeline
# ！！！选自己代理的端口号，每个人的不一样
import os
os.environ["http_proxy"] = "http://127.0.0.1:7890" # ！！！选自己代理的端口号，每个人的不一样
os.environ["https_proxy"] = "http://127.0.0.1:7890" # ！！！选自己代理的端口号，每个人的不一样
# ！！！选自己代理的端口号，每个人的不一样
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))