Lecture 5 Part of Speech Tagging

- - - POS application: Information Extraction 词性应用：信息提取
  - POS Open Class 开放类词性
  - Problem of word classes: Ambiguity 词类问题：模糊性
  - Tagsets 标记集
  - Penn Treebank Tags:
  - Derived Tags: 衍生标签
  - Tagged Text Example 标记文本示例
  - Reasons for automatic POS tagging 自动词性标注的原因
  - Automatic Taggers 自动标注器
  - Unknown Words

Part of Speech(POS)

Also called word classes, morphological classes, syntactic categories 也称为词类、形态类、句法类别
E.g.: nouns, verbs, adjective 例如：名词、动词、形容词
POS tells information about a word and its neighbors: 词性提供了关于单词及其相邻单词的信息
- Nouns are often preceded by determiners 名词通常由限定词前置
- Verbs preceded by nouns 动词通常由名词前置
- content as a noun pronounced as /'kɑ:ntent/
- content as an adjective pronounced as /kən’tent/

POS application: Information Extraction 词性应用：信息提取

Given sentence: “Brasilia, the Brazilian capital, was founded in 1960”
Extract information: 提取信息
- capital(Brazil, Brasilia)
- founded(Brasilia, 1960)
First step of information extraction is finding all POS tags: 信息提取的第一步是找到所有的词性标签
- nouns: Brasilia, capital
- adjective: Brazilian
- verbs: founded
- numbers: 1960

POS Open Class 开放类词性

Open vs. closed: How readily do POS categories take on new words? 开放类 vs. 封闭类：词性类别接受新词的频率如何？
E.g. of open classes: 开放类的例子
- Nouns:
  - Proper(专有名词) vs. common(普通名词): Australia, wombat
  - Mass(集合名词) vs. count(可数名词): rice, bowls
- Verbs:
  - Rich inflection: go/goes/going/gone/went 富有变化
  - Auxiliary verbs(助动词): be, have, do 助动词
  - Transitivity: wait, hit, give 及物性
- Adjectives:
  - Gradable(等级形容词) vs. non-gradable(非等级形容词): happy/happier/happiest, computational
- Adverbs:
  - Manner(情状副词): slowly
  - Locative(处所副词): here
  - Degree(程度副词): really
  - Temporal(时间副词): today
E.g. of closed classes: 封闭类的例子
- Prepositions(介词):
  - in, on, with, for, of, over
- Particles:
  - off
- Determiners(限定词):
  - Articles(冠词): a, an, the
  - Demonstratives(指示词): this, that, these, those
  - Quantifiers(数量词): each, every, some, two
- Pronouns(代词):
  - Personal(人称代词): I, me, she
  - Possessive(所有格代词): my, our
  - Interrogative(疑问代词): who, what
- Conjunctions(连词):
  - Coordinating(并列连词): and, or, but
  - Subordinating(从属连词): if, although, that
- Modal verbs(情态动词):
  - Ability: can, could
  - Permission: can, may
  - Possibility: may, might, could, will
  - Necessity: must

Problem of word classes: Ambiguity 词类问题：模糊性

Many word types belong to multiple classes 许多单词类型属于多个类别
POS depends on context 词性取决于上下文
E.g.: flies
- The word flies in the first sentence is an inflection of the verb “fly” 在第一句中，flies 是动词 “fly” 的变形
- The word flies in the second sentence is the plural form of the noun “fly” 在第二句中，flies 是名词 “fly” 的复数形式

Tagsets

Tagsets 标记集

A compact representation of POS information 词性信息的紧凑表示
- Usually less than 4 capitalized characters. E.g. NN = noun 通常少于4个大写字符。例如 NN = noun
- Often includes inflectional distinctions 经常包括形态变化的区别
Major English tagsets: 主要的英语标记集
- Brown: 87 tags
- Penn Treebank: 45 tags
- CLAWS/BNC: 61 tags
- Universal: 12 tags
At least one tagset for all major languages 所有主要语言至少有一个标记集

Penn Treebank Tags:

Open classes: 开放类
- NN: noun 名词
- VB: verb 动词
- JJ: adjective 形容词
- RB: adverb 副词
Closed classes: 封闭类
- DT: determiner 限定词
- CD: cardinal number 基数
- IN: preposition 介词
- PRP: personal pronoun 人称代词
- MD: modal 情态动词
- CC: coordinating conjunction 并列连词
- RP: particle 助词
- WH: wh-pronoun 疑问代词
- TO: to

Derived Tags: 衍生标签

Open classes: 开放类
- NN (noun singular): 单数名词
  - NNS (plural) 复数
  - NNP (proper) 专有名词
  - NNPS (proper plural) 复数专有名词
- VB (verb infinitive): 不定式动词
  - VBP (1st/2nd person present) 第一/第二人称现在时
  - VBZ (3rd person singular) 第三人称单数
  - VBD (past tense) 过去时
  - VBG (gerund) 现在分词
  - VBN (past participle) 过去分词
- JJ (adjective): 形容词
  - JJR (comparative) 比较级
  - JJS (superlative) 最高级
- RB (adverb): 副词
  - RBR (comparative) 比较级
  - RBS (superlative) 最高级
Closed classes: 封闭类
- PRP (pronoun personal): 人称代词
  - PRP$ (possessive) 所有格
- WP (wh-pronoun): 疑问代词
  - WP$ (possessive) 所有格
  - WDT (wh-determiner) 疑问限定词)
  - WRB (wh-adverb) 疑问副词

Tagged Text Example 标记文本示例

Automatic Tagging

Reasons for automatic POS tagging 自动词性标注的原因

Important for morphological analysis. E.g. lemmatization 对形态分析很重要。例如：词形还原
For some applications, we want to focus on certain POS 对于某些应用，我们希望关注某些词性
- E.g. nouns are important for information retreieval, adjectives for sentiment analysis 例如：名词对于信息检索很重要，形容词对于情感分析很重要
Very useful features for certain classification tasks. 对于某些分类任务，这是非常有用的特性
- E.g. genre attribution 体裁属性
POS tags can offer word sense disambiguation 词性标签可以提供词义消歧
- E.g. cross/NN, cross/VB, cross/JJ all have different means
Can use them to create larger structures 可以用它们来创建更大的结构

Automatic Taggers 自动标注器

Rule-based taggers 基于规则的标注器
Statistical taggers 统计标注器
- Unigram tagger 一元标注器
- Classifier-based tagger 基于分类器的标注器
- Hidden Markov Model tagger 隐马尔科夫模型标注器

Rule-Based Tagging

Typically starts with a list of possible tags for each word. Source from a lexical resource or a corpus 通常从词典或语料库中为每个单词列出可能的标签开始
Often includes other lexcial information. E.g. verb subcategorization 经常包括其他词汇信息。例如：动词下类化
Apply rules to narrow down to a single tag 应用规则以缩小到一个标签
Large systems have thousands of constraints 大型系统有数千个约束

Unigram Tagger

Assign most common tag to each word type 为每个单词类型分配最常见的标签
Requires a corpus of tagged words 需要一个标记过的词语的语料库
Just a look-up table 只是一个查找表
Approximately 90% accuracy 精度约为90%
Often considered the baseline for more complex approaches 通常被认为是更复杂方法的基线

Classifier-Based Tagging

Use a standard discriminative classifier like logistic regression or neural network with features: 使用如逻辑回归或神经网络这样的标准判别式分类器，其特征包括
- Target word 目标词
- Lexical context around the word 词周围的词汇上下文
- Already classified tags in the sentence 句子中已分类的标签
Can suffer from error propagation: wrong predictions from previous steps affect the next ones 可能受到错误传播的影响：前一步的错误预测影响下一步

Hidden Markov Models

A basic sequential model 一个基本的序列模型
Like sequential classifiers, use both previous tag and lexical evident 与序列分类器一样，使用前一个标签和词汇证据
Unlike classifiers, considers all possibilities of previous tag and treat previous tag evidence and lexical evidence as independent from each other 与分类器不同的是，它考虑了前一个标签的所有可能性，并将前一个标签的证据和词汇证据视为相互独立的
- Less sparsity 稀疏度较小
- Fast algorithms for sequential prediction 针对序列预测的快速算法

Unknown Words

Huge problem in morphologically rich languages 在形态丰富的语言中是一个巨大的问题
Can use things already seen only once to best guess for things never seen before 可以使用已经看到一次的事物来对从未见过的事物进行最佳猜测
- Tend to be nouns, followed by verbs 倾向于是名词，然后是动词
- Unlikely to be determiners 不太可能是限定词
Can use sub-word representations to capture morphology 可以使用子词表示来捕获形态

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/25724.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！