transformers-AutoClass

https://huggingface.co/docs/transformers/main/en/autoclass_tutorialhttps://huggingface.co/docs/transformers/main/en/autoclass_tutorialAutoClass可以自动推断和加载给定checkpoint的正确架构。

对于文本，使用Tokenizer将文本转换为token序列，创建token的数字表示，并将它们组装成张量。对于语音和音频，使用Feature extractor从音频波形中提取连续的特征，并将其转换为张量。图像输入使用ImageProcessor将图像转换为张量。对于多模态输入，使用Processor结合分词器和特征提取器或图像处理器。

1.AutoTokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

预处理文本数据的主要工具是分词器，分词器根据一组规则将文本拆分为标记，将标记转换为数字和张量，成为模型输入。如果使用预训练模型，使用相关的预训练分词器非常重要。确保文本的拆分方式与预训练语料库相同，并在预训练期间使用相应的标记索引（词汇表）。

attention_mask是一个与输入序列长度相同的二进制向量。它指示哪些位置是真实的标记，而哪些位置是填充标记。填充标记通常用于确保不同长度的句子能够对齐并进行批处理。通过将填充标记的attention_mask设置为0，模型在计算注意力时可以忽略这些位置。这样可以提高计算效率，并确保模型正确处理真实标记。token_type_ids是一个与输入序列长度相同的整数向量，用于区分不同的序列。当输入包含多个序列时，例如问答任务中的问题和回答序列，token_type_ids用于区分哪些标记属于问题序列，哪些属于回答序列。这有助于模型理解和区分不同序列之间的关系。总之，attention_mask用于指示填充标记，以便模型可以忽略它们，而token_type_ids用于区分不同的序列，以便模型可以适当地处理它们。

tokenizer.decode(encoded_input["input_ids"])
'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

分词器向句子中添加了两个特殊标记 - CLS和SEP（分类器和分隔符）。并非所有模型都需要特殊标记，但如果需要，分词器会自动为您添加它们。

1.1 pad

句子的长度并不总是相同的，张量（模型输入）需要具有统一的形状。填充是确保张量矩形化的一种策略，它通过向较短的句子添加特殊填充标记来实现。将padding参数设置为True，可以将批中较短的序列填充到与最长序列相匹配的长度。

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

1.2 truncation

有时序列可能太长，以至于模型无法处理。在这种情况下，需要将序列截断为更短的长度。将truncation参数设置为True，可以将序列截断到模型接受的最大长度：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

1.3 build tensors

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

2.AutoImageProcessor

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

3.AutoFeatureExtractor

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

4.AutoProcessor

LayoutLMv2需要一个image procssor处理图像，一个tokenizer处理文本，AutoProcessor将两者结合起来。

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

5.AutoModel

加载给定任务的预训练模型，

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/112776.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！

transformers-AutoClass

相关文章

IDEA中application.properties文件中文乱码

通过netstat命令查看Linux的端口占用

毕业设计基于SpringBoot+Vue智慧云办公系统源码+数据库+项目文档

随想录一刷·数组part1

虚拟机部署与发布J2EE项目（Linux版本）

单元化架构的思考

Stable Diffusion WebUI扩展openpose-editor如何使用

高校教务系统登录页面JS分析——天津大学

第2篇机器学习基础 —（4）k-means聚类算法

Prometheus+Grafana

什么是NPM（Node Package Manager）？它的作用是什么？

sort的第三个参数与priority_queue的第三个模板参数

腾讯云轻量应用镜像、系统镜像、Docker基础镜像、自定义镜像和共享镜像介绍

[PHP]pearProject协作系统 v2.8.14 前后端

艾奇免费KTV电子相册视频制作软件

基于SC-LeGO-LOAM的建图和ndt_localizer的定位

车载电子电器架构 —— 基于AP定义车载HPC

Kubernetes 高级调度 - Affinity

主从复制(gtid方式)

「专题速递」数据驱动赋能、赛事直播优化、RTC技术、低延时传输引擎、多媒体处理框架、GPU加速...