SentencePiece进行文本分类

SentencePieces

前言

Step1:故事

  • SentencePiece 是一个无监督的文本分词器和 detokenizer(还原回去的?)
  • 主要用于词汇表大小是预定的文本生成系统中
  • 它拓展了原始句子的训练,实现子词单元如 BPE 和 unigram language model
  • 技术亮点
    • 纯数据驱动,纯从句子中训练 tokenizer 和 detokenizer。不总是需要预训练模型
    • 语言独立:把句子视为 Unicode 字符,没有语言逻辑
    • 多个子词算法: BPE 和 Unigram LM
    • 子词正则化:实现了 子词正则化 和 BPE dropout 的子词采样,有助于提高鲁棒性和准确性。
    • 快、轻量级:每秒 50k 个句子,内存大概 6MB
    • 自包含:相同的模型文件相同的 tokenizer
    • 快速词汇 id 生成
    • NFKC 的正则化
      • NFC : 组合形式,字符被标准化为单个预组合字符(合成字符)
      • NFD : 分解模型,字符被标准化为基本字符加上组合符号的形式(分解模式)—— 原始字符:é —> NFD 形式:e + ´
      • NFKC : 兼容性组合模式,类似 NFC,但在标准化过程中可能会删除某些格式化信息
      • NFKD : 兼容性分解模式,类似 NFD,但在标准化过程中可能会删除某些格式化信息
  • **吐槽:**这些 HF 的 tokenizers 都能做。。。。。。而且 Tokenizers 做的更多

1.什么是 SentencePiece

  • 是子词单元的重新实现,缓解开放词汇表问题
  • 独一无二的 token 数量是预定的,例如 8k, 16k, 32k
  • 用未处理的句子训练
    • 以前的子词实现为了告诉训练,需要提前将输入句子 token 化。
    • SentencePiece 实现很快,可以使用原始句子训练模型。这对于中文或日语很有用
  • 空格被视为基本符号
    • 原来,(word.) == (word .)
    • 现在,(word.) != (word_.)
    • 因为空格被保存到了句子中,所以可以不含糊的 detokenize 回去;对比原来是不可你转的
    • 这让 tokenization 没有语言依赖成为了可能

2.子词正则化 和 BPE Dropout

  • 目的:用于子词分割和模型训练,旨在提高模型的泛化能力和鲁棒性
  • 子词正则化:
    • 远离:在训练时不会固定使用一种分割方法,而是从多种分割方案中,随机选择一种。增强模型应对多样性输入的能力
    • 优点:引入分词的不确定性,提高鲁棒性和泛海能力。对低资源等数据较少的场景友好
  • BPE Dropout
    • 原理:常规 BPE 中,每次选频率最高的字符进行合并,而 BPE Dropout 会随机丢弃一些合并步骤。意味着在训练中,同一个词语在不同的迭代中可能被分割成不同的子词序列。
    • 优点:引入随机性,鲁棒性,饭还行。对 OOV 问题友好

3.安装

  • pip 安装
pip install sentencepiece
  • c++ 源码安装
git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo update_dyld_shared_cache
# sudo ldconfig -v --> ubuntu

Step2:使用指南

1.训练 SentencePiece 模型

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
--input: 每行一句的语料库文件。默认使用 NFKC。可以传递逗号分隔的文教列表。
--model_prefix: 输出模型名字前缀。生成 xx.model 和 xx.vocab
--vocab_size: 词汇表大小,如 8000, 16000, 32000
--character_coverage: 模型涵盖的字符数量,好的默认是 0.9995(中文或日语等丰富的字符集),小字符集可以是 1.0
--model_type: 模型类型,选择 unigram(默认), bpe, char, word

剩下的...
--input (comma separated list of input sentences)  type: std::string default: ""
--input_format (Input format. Supported format is `text` or `tsv`.)  type: std::string default: ""
--model_prefix (output model prefix)  type: std::string default: ""
--model_type (model algorithm: unigram, bpe, word or char)  type: std::string default: "unigram"
--vocab_size (vocabulary size)  type: int32 default: 8000
--accept_language (comma-separated list of languages this model can accept)  type: std::string default: ""
--self_test_sample_size (the size of self test samples)  type: int32 default: 0
--character_coverage (character coverage to determine the minimum symbols)  type: double default: 0.9995
--input_sentence_size (maximum size of sentences the trainer loads)  type: std::uint64_t default: 0
--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0)  type: bool default: true
--seed_sentencepiece_size (the size of seed sentencepieces)  type: int32 default: 1000000
--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss)  type: double default: 0.75
--num_threads (number of threads for training)  type: int32 default: 16
--num_sub_iterations (number of EM sub-iterations)  type: int32 default: 2
--max_sentencepiece_length (maximum length of sentence piece)  type: int32 default: 16
--max_sentence_length (maximum length of sentence in byte)  type: int32 default: 4192
--split_by_unicode_script (use Unicode script to split sentence pieces)  type: bool default: true
--split_by_number (split tokens by numbers (0-9))  type: bool default: true
--split_by_whitespace (use a white space to split sentence pieces)  type: bool default: true
--split_digits (split all digits (0-9) into separate pieces)  type: bool default: false
--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.)  type: bool default: false
--allow_whitespace_only_pieces (allow pieces that only contain (consecutive) whitespace tokens)  type: bool default: false
--control_symbols (comma separated list of control symbols)  type: std::string default: ""
--control_symbols_file (load control_symbols from file.)  type: std::string default: ""
--user_defined_symbols (comma separated list of user defined symbols)  type: std::string default: ""
--user_defined_symbols_file (load user_defined_symbols from file.)  type: std::string default: ""
--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage)  type: std::string default: ""
--required_chars_file (load required_chars from file.)  type: std::string default: ""
--byte_fallback (decompose unknown pieces into UTF-8 byte pieces)  type: bool default: false
--vocabulary_output_piece_score (Define score in vocab file)  type: bool default: true
--normalization_rule_name (Normalization rule name. Choose from nfkc or identity)  type: std::string default: "nmt_nfkc"
--normalization_rule_tsv (Normalization rule TSV file. )  type: std::string default: ""
--denormalization_rule_tsv (Denormalization rule TSV file.)  type: std::string default: ""
--add_dummy_prefix (Add dummy whitespace at the beginning of text)  type: bool default: true
--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace)  type: bool default: true
--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.)  type: bool default: true
--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.)  type: bool default: false
--unk_id (Override UNK (<unk>) id.)  type: int32 default: 0
--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.)  type: int32 default: 1
--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.)  type: int32 default: 2
--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.)  type: int32 default: -1
--unk_piece (Override UNK (<unk>) piece.)  type: std::string default: "<unk>"
--bos_piece (Override BOS (<s>) piece.)  type: std::string default: "<s>"
--eos_piece (Override EOS (</s>) piece.)  type: std::string default: "</s>"
--pad_piece (Override PAD (<pad>) piece.)  type: std::string default: "<pad>"
--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.)  type: std::string default: " ⁇ "
--train_extremely_large_corpus (Increase bit depth for unigram tokenization.)  type: bool default: false
--random_seed (Seed value for random generator.)  type: uint32 default: 4294967295
--enable_differential_privacy (Whether to add DP while training. Currently supported only by UNIGRAM model.)  type: bool default: false
--differential_privacy_noise_level (Amount of noise to add for DP)  type: float default: 0
--differential_privacy_clipping_threshold (Threshold for clipping the counts for DP)  type: std::uint64_t default: 0
--help (show help)  type: bool default: false
--version (show version)  type: bool default: false
--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere)  type: int default: 0

2.编码未处理的文本到 sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output
spm_encode --model=<model_file> --output_format=id < input > output
  • 使用 --extra_options 去添加 BOS/EOS 或 反向输入序列
spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

3.解码 sentence pieces/ids

spm_decode --model=<model_file> --input_format=piece < input > output
spm_decode --model=<model_file> --input_format=id < input > output
  • 使用 --extra_options 在反向顺序中解码文本
spm_decode --extra_options=reverse < input > output

4.端到端的例子

spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
'''
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model
'''
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
'''

echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
'''
9 459 11 939 44 11 4 142 82 8 28 21 132 6
'''

echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
'''
I saw a girl with a telescope.
'''

5.导出词汇列表

spm_export_vocab --model=<model_file> --output=<output file>

6.例子

notebook

Step3:实验

1.数据集

酒店评论数据集,处理成每行一句的形式

在这里插入图片描述

2.使用

  1. 训练(我弄的是12800 词汇表大小)

    在这里插入图片描述

生成了两个文件,一个是模型文件,一个是词表文件
在这里插入图片描述
词表文件如下在这里插入图片描述

  1. 分词

    • 直接分词就好了,因为任务是分类,不需要插入 eos 和 bos

    在这里插入图片描述

    在这里插入图片描述

    • 分成 id
      • note : 生成的词汇表的顺序正好是对应的 词 id 的自增顺序

    在这里插入图片描述

    在这里插入图片描述

  2. 并没有对应的词向量文件,看来还需要对这些词进行词嵌入训练,还是用fasttext好了。

    • 写个脚本变成 fasttext 需要的形式

    在这里插入图片描述

  3. id 和 词向量都有了,可以构造词嵌入矩阵了

    • 对应关系是

      • fast_vec —> word : vec
      • spm_vec —> id : word
      • 构造 embedding —> id:vec
      • 1.2w 数据中有 20 行为 空,不多,对空值处理为随机值吧

      在这里插入图片描述

      • 写个脚本,然后保存为词向量的 .npy 文件,留着模型用

      在这里插入图片描述

    • 思想

      • 用 sentencepiece 作为分词器,得到一系列 id
      • 把 id 为给模型
      • 模型训练
      • 推理的时候也是 sentencepiece 分词
    • 实践开始吧~

      • 代码
        上方资源处自取

      • 效果:基本收敛到了 96%

        在这里插入图片描述

      • 30之后连同嵌入层一起微调10轮,准确率又上去了一个百分点

        在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/883666.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

那年我双手插兜,使用IPv6+DDNS动态域名解析访问NAS

估计有很多科技宅和我一样&#xff0c;会买一个NAS存储或者自己折腾刷一下黑群晖玩玩&#xff0c;由于运营商不给分配固定的公网IP&#xff0c;就导致我在外出的时候无法访问家里的NAS&#xff0c;于是远程访问常常受到IP地址频繁变动的困扰。为了解决这一问题&#xff0c;结合…

【HTTP】请求“报头”,Referer 和 Cookie

Referer 描述了当前这个页面是从哪里来的&#xff08;从哪个页面跳转过来的&#xff09; 浏览器中&#xff0c;直接输入 URL/点击收藏夹打开的网页&#xff0c;此时是没有 referer。当你在 sogou 页面进行搜索时&#xff0c;新进入的网页就会有 referer 有一个非常典型的用…

gitlab默认克隆地址的修改

目录 1.找到opt/gitlab/embedded/service/gitlab-rails/config目录&#xff0c;打开gitlab.yml 2.修改地址和端口 3.重启gitlab 1.找到opt/gitlab/embedded/service/gitlab-rails/config目录&#xff0c;打开gitlab.yml cd /opt/gitlab/embedded/service/gitlab-rails/confi…

jmeter断言---响应断言

请求http://www.baidu.com 检查&#xff1a;让程序检查响应数据中是否包含“百度一下&#xff0c;你就知道” 操作步骤&#xff1a; 1.添加线程组 2.添加http请求 3.添加断言&#xff08;需要在http请求下添加断言&#xff0c;而且可以根据断言测试字段等信息新建不同的断…

docker-图形化工具-portainer的使用

文章目录 1、安装和启动2、设置登陆密码3、dashboard 上述对容器和镜像的管理都是基于docker客户端的命令来完成&#xff0c;不太方便。为了方便的对docker中的一些对象(镜像、容器、数据卷…)来进行管理&#xff0c;可以使用Portainer来完成。Portainer是一个可视化的容器镜像…

【RabbitMQ】RabbitMQ 的概念以及使用RabbitMQ编写生产者消费者代码

目录 1. RabbitMQ 核心概念 1.1生产者和消费者 1.2 Connection和Channel 1.3 Virtual host 1.4 Queue 1.5 Exchange 1.6 RabbitMO工作流程 2. AMQP 3.RabbitMO快速入门 3.1.引入依赖 3.2.编写生产者代码 ​3.3.编写消费者代码 4.源码 1. RabbitMQ 核心概念 在安装…

LiveNVR监控流媒体Onvif/RTSP功能-支持电子放大拉框放大直播视频拉框放大录像视频流拉框放大电子放大

LiveNVR监控流媒体Onvif/RTSP功能-支持电子放大拉框放大直播视频拉框放大录像视频流拉框放大电子放大 1、视频广场2、录像回看3、RTSP/HLS/FLV/RTMP拉流Onvif流媒体服务 1、视频广场 视频广场 -》播放 &#xff0c;左键单击可以拉取矩形框&#xff0c;放大选中的范围&#xff…

NLP-transformer学习:(7)evaluate实践

NLP-transformer学习&#xff1a;&#xff08;7&#xff09;evaluate 使用方法 打好基础&#xff0c;为了后面学习走得更远。 本章节是单独的 NLP-transformer学习 章节&#xff0c;主要实践了evaluate。同时&#xff0c;最近将学习代码传到&#xff1a;https://github.com/Mex…

STL之vector篇(下)(手撕底层代码,从零实现vector的常用指令,深度剖析并优化其核心代码)

文章目录 1.基本结构与初始化1.1 空构造函数的实现与测试1.2 带大小和默认值的构造函数1.3 使用迭代器范围初始化的构造函数(建议先看完后面的reserve和push_back)1.4 拷贝构造函数1.5 赋值操作符的实现&#xff08;深拷贝&#xff09;1.6 析构函数1.7 begin 与 end 迭代器 2. …

使用宝塔部署项目在win上

项目部署 注意&#xff1a; 前后端部署项目&#xff0c;需要两个域名&#xff08;二级域名&#xff0c;就是主域名结尾的域名&#xff0c;需要在主域名下添加就可以了&#xff09;&#xff0c;前端一个&#xff0c;后端一个 思路&#xff1a;访问域名就会浏览器会加载前端的代…

如何守护变美神器安全?红外热像仪:放开那根美发棒让我来!

随着智能家电市场的迅速发展&#xff0c;制造商们越来越关注生产过程中效率和质量的提升。如何守护变美神器安全&#xff1f;红外热像仪&#xff1a;放开那根卷发棒让我来&#xff01; 美发棒生产遇到什么困境&#xff1f; 美发棒生产过程中会出现设备加热不均情况&#xff0c…

[数据库实验五] 审计及触发器

一、实验目的与要求&#xff1a; 1.了解MySQL审计功能及实现方式 2.掌握触发器的工作原理、定义及操作方法 二、实验内容&#xff1a; 注&#xff1a; 在同一个触发器内编写多行代码&#xff0c;需要用结构begin ……end 函数current_user()获得当前登录用户名 1.自动保存…

智慧城市主要运营模式分析

(一)运营模式演变 作为新一代信息化技术落地应用的新事物,智慧城市在建设模式方面借鉴了大量工程建设的经验,如平行发包(DBB,Design-Bid-Build)、EPC工程总承包、PPP等模式等,这些模式在不同的发展阶段和条件下发挥了重要作用。 在智慧城市发展模式从政府主导、以建为主、…

linux----进程地址空间

前言 提示&#xff1a;以下是本篇文章正文内容&#xff0c;下面案例可供参考 一、空间分布 二、栈和堆的特点 &#xff08;1&#xff09;栈堆相对而生&#xff0c;堆是向上增长的&#xff0c;栈是向下增长的。 验证&#xff1a;堆是向上增长的 这里我们看到申请的堆&#xff…

记一次Windows状态栏不显示问题

文章目录 &#x1fa9f;解决方案☁️单次处理☁️有效处理 &#x1fa9f;现象&#x1fa9f;尝试的操作⭐END&#x1f31f;跋&#x1f31f;交流方式 &#x1fa9f;解决方案 ☁️单次处理 重启explorer.exe 命令行操作 注意&#xff0c;使用命令行操作的时候&#xff0c;出现…

链动 2+1 模式 S2B2C 商城小程序源码:创新价格盈利模式探索

摘要&#xff1a;本文深入探讨了价格盈利模式的两种类型&#xff0c;即价格返利模式和动态定价盈利模式。通过引入链动 21 模式 S2B2C 商城小程序源码&#xff0c;分析其在实现这两种价格盈利模式方面的优势和应用场景&#xff0c;为朋友圈卖货及电商领域的发展提供新的思路和方…

QT菜单之快捷菜单设计

快捷菜单又称为上下文菜单&#xff0c;通常在用鼠标右击的时候弹出。创建快捷菜单的方法和创建菜单栏菜单类似。 效果图&#xff1a; 一、将MainWindow类对象的ContextMenuPolicy属性设置为customContextMenu。 打开mainWindow.ui&#xff0c;在属性视图上找到ContextMenuPoli…

一文掌握python单元测试unittest(二)

接上篇:https://blog.csdn.net/qq_38120851/article/details/141642215 目录 四、参数化测试 1、使用 subTest 2、使用装饰器 3)使用第三方库parameterized 五、跳过测试 1、使用 unittest.skip() 或 unittest.skipIf() 装饰器: 2、使用 setUp() 方法中的断言来跳过整…

EasyCVR智慧公园视频智能管理方案:赋能公园安全管理新高度

随着城市化进程的加速&#xff0c;智慧城市建设已成为提升城市管理效率、增强居民生活质量的重要途径。智慧公园作为智慧城市的重要组成部分&#xff0c;其安全与管理水平直接影响着市民的休闲娱乐体验。EasyCVR智慧公园视频智能管理方案&#xff0c;正是基于这一背景应运而生&…

Android 车载应用开发指南 - CarService 详解(下)

车载应用正在改变人们的出行体验。从导航到娱乐、从安全到信息服务&#xff0c;车载应用的开发已成为汽车智能化发展的重要组成部分。而对于开发者来说&#xff0c;如何将自己的应用程序无缝集成到车载系统中&#xff0c;利用汽车的硬件和服务能力&#xff0c;是一个极具挑战性…