使用机器学习进行语法错误检测/纠正

一、说明

一般的学习，特别是深度学习，促进了自然语言处理。各种模型使人们能够执行机器翻译、文本摘要和情感分析——仅举几个用例。今天，我们将研究另一个流行的用途：我们将使用Gramformer构建一个用于机器学习语法错误检测和纠正的管道。

阅读本文后，您将...

了解如何将 Transformer 用于自然语言处理。
使用Python构建了基于Gramformer的语法错误检测和纠正系统。
使用 HuggingFace Transformers 而不是 Gramformer 存储库构建了相同的系统。

一起来看看吧！

二、用于自然语言处理的转换器

自 2012 年 AI 取得突破性进展以来，基于深度学习的技术已经改变了机器学习领域。虽然这一突破是在计算机视觉领域，但应用此类模型的另一个突出领域是自然语言处理。

自 2017 年以来，基于 Transformer 的型号越来越受欢迎。在我们深入研究 Gramformer 的语法检查和更正之前，最好提供一个简短的 Transformer 背景，以便每个人都能理解 Gramformer 的上下文。点击链接查看更详细的介绍。

书面和口头文本是一系列单词，最终甚至是字母。字母与单词的组合和单词的组合，例如书面文本的语法，具有潜在的语义或意义。这意味着当神经网络要处理文本时，它们必须能够处理这些含义。因此，他们必须能够按顺序处理文本，否则他们将无法捕捉到含义。在处理所有单词和字母之前混合所有单词和字母的模型不会带来任何好处，不是吗？

传统上，NLP 使用递归神经网络（如 LSTM）来处理文本。递归神经网络是一种网络，其中前一个“传递”的输出通过递归连接传递到下一个“传递”。换言之，在运行期间之前处理过的内容的历史记录（例如，“我正在前往......”的路上。在“超市”之前处理）用于预测下一个输出。例如，在翻译的情况下，这可能非常有用：翻译有时高度依赖于以前产生的内容的含义。

确切地说，这个循环段是循环神经网络的瓶颈。这意味着序列的每个元素（例如，每个单词）都必须按顺序处理。此外，由于LSTM使用“记忆”，因此很久以前处理的单词的记忆（例如，20个单词前的长短语）会消失，从而可能隐藏在复杂短语中的语义依赖关系。换句话说，使用递归神经网络和LSTM是非常无效的，特别是对于较长的句子。

2017 年，Vaswani 等人开发了一种全新的语言处理架构——Transformer 架构。通过以不同的方式应用注意力机制，他们表明注意力就是你所需要的——这意味着不再需要重复的片段。原始 Transformer 架构如下图所示，包括 N 个编码器段和 N 个解码器段。编码器段将文本联合处理为中间表示形式，该表示形式以压缩方式包含语义。这是通过计算多头自注意力来完成的，这种机制本质上允许我们从不同角度（多头）比较单个单词（自注意力）的重要性。再次，如果您想更详细地了解此机制，请查看上面的链接。

然后，每个编码器段的中间表示形式被传递到相应的解码器段中，如图所示。编码器段将源序列作为其输入（例如法语短语），解码器将相应的目标作为其输入（例如英语翻译）。通过计算目标短语中单词的个体重要性，然后将这些单词与源短语的中间表示相结合，模型可以学习产生正确的翻译。

除了传统上使用这种序列到序列架构执行的翻译之外，Transformer 还应用于文本生成（使用类似 GPT 的架构，使用解码器部分）和文本解释（主要使用类似 BERT 的架构，使用编码器部分）。

不让我们看一个 Gramformer。

三、使用 Gramformer 进行语法错误检测和纠正

Gramformer是一个开源工具，用于检测和纠正英语文本中的语法错误：

Gramformer 是一个库，它向一系列算法公开了 3 个独立的接口，以检测、突出显示和纠正语法错误。为确保推荐的更正和突出显示是高质量的，它带有一个质量估算器。

Github （未注明日期）

3.1 使用机器学习进行语法检测和纠正 — 示例代码

现在让我们看一下如何使用 Gramformer 构建一个语法错误检测和纠正系统。在下面，您将找到有关如何安装 Gramformer、如何使用它来获取更正的文本、进行单独编辑以及在检测到错误时获取突出显示的说明。

3.2 安装 Gramformer

安装 Gramformer 非常简单 — 您可以直接从 Gramformer GitHub 存储库进行安装：pip

pip install -U git+https://github.com/PrithivirajDamodaran/Gramformer.git

安装 Gramformer 时可能出现的问题

问题lm-scorer
Errant 未安装
En 未找到 nlp - SpaCy OSError: Can't find model 'en' - Stack Overflow

四、获取正确的文本

从 Gramformer 获取更正后的文本非常容易，需要以下步骤：

指定导入。
修复 PyTorch 种子。
初始化 Gramformer。
指定不正确的短语。
让 Gramformer 对短语提出建议，包括更正。
打印更正的短语。

让我们从导入开始。我们导入 PyTorch，通过 .Gramformertorch

# Imports
from gramformer import Gramformer
import torch

然后，我们修复种子。这意味着所有随机数生成都使用相同的初始化向量执行，并且任何偏差都不能与随机数生成相关。

# Fix seed, also on GPU
def fix_seed(value):
  torch.manual_seed(value)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(value)
    
fix_seed(42)

然后，我们初始化 Gramformer。我们将模型设置为 1，即校正模式，并指示它不要使用 GPU。如果你有专用的GPU，当然可以将其设置为True。

# Initialize Gramformer
grammar_correction = Gramformer(models = 1, use_gpu=False)

然后，让我们创建一个包含三个语法不正确的短语的列表：

# Incorrect phrases
phrases = [
  'How is you doing?',
  'We is on the supermarket.',
  'Hello you be in school for lecture.'
]

...之后，我们可以让 Gramformer 改进它们。对于每个短语，我们让 Gramformer 通过建议两个候选者来执行更正，然后打印带有改进建议的错误短语。

# Improve each phrase
for phrase in phrases:
  corrections = grammar_correction.correct(phrase, max_candidates=2)
  print(f'[Incorrect phrase] {phrase}')
  for i in range(len(corrections)):
    print(f'[Suggestion #{i}] {corrections[i]}')
  print('~'*100)

总的来说，这将生成以下代码：

# Imports
from gramformer import Gramformer
import torch

# Fix seed, also on GPU
def fix_seed(value):
  torch.manual_seed(value)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(value)
    
fix_seed(42)

# Initialize Gramformer
grammar_correction = Gramformer(models = 1, use_gpu=False)

# Incorrect phrases
phrases = [
  'How is you doing?',
  'We is on the supermarket.',
  'Hello you be in school for lecture.'
]

# Improve each phrase
for phrase in phrases:
  corrections = grammar_correction.correct(phrase, max_candidates=2)
  print(f'[Incorrect phrase] {phrase}')
  for i in range(len(corrections)):
    print(f'[Suggestion #{i}] {corrections[i]}')
  print('~'*100)

这些是运行它时的结果：

[Gramformer] Grammar error correct/highlight model loaded..
[Incorrect phrase] How is you doing?
[Suggestion #0] ('How are you doing?', -20.39444351196289)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] We is on the supermarket.
[Suggestion #0] ("We're in the supermarket.", -32.21493911743164)
[Suggestion #1] ('We are at the supermarket.', -32.99837112426758)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] Hello you be in school for lecture.
[Suggestion #0] ('Hello, are you in school for the lecture?', -48.61809539794922)
[Suggestion #1] ('Hello, you are in school for lecture.', -49.94304275512695)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

伟大！我们刚刚构建了一个语法问题检查器和更正工具！

五、获取个人编辑

除了更正的短语，我们还可以打印 Gramformer 执行的编辑：

# Print edits for each improved phrase
for phrase in phrases:
  corrections = grammar_correction.correct(phrase, max_candidates=2)
  print(f'[Incorrect phrase] {phrase}')
  for i in range(len(corrections)):
    edits = grammar_correction.get_edits(phrase, corrections[i][0])
    print(f'[Edits #{i}] {edits}')
  print('~'*100)

您可以看到，对于第一个短语，is 已改进为 are;“我们在上面”在第二句话中变成了“我们在”，依此类推。

[Incorrect phrase] How is you doing?
[Edits #0] [('VERB:SVA', 'is', 1, 2, 'are', 1, 2)]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] We is on the supermarket.
[Edits #0] [('OTHER', 'We is on', 0, 3, "We're in", 0, 2)]
[Edits #1] [('VERB:SVA', 'is', 1, 2, 'are', 1, 2), ('PREP', 'on', 2, 3, 'at', 2, 3)]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] Hello you be in school for lecture.
[Edits #0] [('OTHER', 'Hello', 0, 1, 'Hello,', 0, 1), ('VERB', '', 1, 1, 'are', 1, 2), ('VERB', 'be', 2, 3, '', 3, 3), ('DET', '', 6, 6, 'the', 6, 7), ('NOUN', 'lecture.', 6, 7, 'lecture?', 7, 8)]
[Edits #1] [('OTHER', 'Hello', 0, 1, 'Hello,', 0, 1), ('MORPH', 'be', 2, 3, 'are', 2, 3)]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

六、获取亮点

只需将 get_edits 更改为突出显示即可生成标记错误的原始短语：

# Print highlights for each improved phrase
for phrase in phrases:
  corrections = grammar_correction.correct(phrase, max_candidates=2)
  print(f'[Incorrect phrase] {phrase}')
  for i in range(len(corrections)):
    highlights = grammar_correction.highlight(phrase, corrections[i][0])
    print(f'[Highlights #{i}] {highlights}')
  print('~'*100)

换言之：

[Gramformer] Grammar error correct/highlight model loaded..
[Incorrect phrase] How is you doing?
[Highlights #0] How <c type='VERB:SVA' edit='are'>is</c> you doing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] We is on the supermarket.
[Highlights #0] <c type='OTHER' edit='We're in'>We is on</c> the supermarket.
[Highlights #1] We <c type='VERB:SVA' edit='are'>is</c> <c type='PREP' edit='at'>on</c> the supermarket.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Incorrect phrase] Hello you be in school for lecture.
[Highlights #0] <a type='VERB' edit='<c type='OTHER' edit='Hello,'>Hello</c> are'><c type='OTHER' edit='Hello,'>Hello</c></a> you <d type='VERB' edit=''>be</d> in school <a type='DET' edit='for the'>for</a> <c type='NOUN' edit='lecture?'>lecture.</c>
[Highlights #1] <c type='OTHER' edit='Hello,'>Hello</c> you <c type='MORPH' edit='are'>be</c> in school for lecture.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

七、将 Gramformer 与 HuggingFace 变形金刚一起使用

根据 setup.py 安装说明，Gramformer 构建在 HuggingFace Transformers 之上。这意味着您还可以使用 HuggingFace Transformers 构建 Gramformer，这意味着您不需要使用 pip 安装 Gramformer 存储库。下面的示例说明了如何将 AutoTokenizer 和 AutoModelForSeq2SeqLM 与预训练的 Gramformer 分词器/模型结合使用进行语法检查：

# Imports
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("prithivida/grammar_error_correcter_v1")

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/grammar_error_correcter_v1")

# Incorrect phrases
phrases = [
  'How is you doing?',
  'We is on the supermarket.',
  'Hello you be in school for lecture.'
]

# Tokenize text
tokenized_phrases = tokenizer(phrases, return_tensors='pt', padding=True)

# Perform corrections and decode the output
corrections = model.generate(**tokenized_phrases)
corrections = tokenizer.batch_decode(corrections, skip_special_tokens=True)

# Print correction
for i in range(len(corrections)):
  original, correction = phrases[i], corrections[i]
  print(f'[Phrase] {original}')
  print(f'[Suggested phrase] {correction}')
  print('~'*100)

...结果：

[Phrase] How is you doing?
[Suggested phrase] How are you doing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Phrase] We is on the supermarket.
[Suggested phrase] We are at the supermarket.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Phrase] Hello you be in school for lecture.
[Suggested phrase] Hello you are in school for lecture.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~