目录
- POS application: Information Extraction 词性应用:信息提取
- POS Open Class 开放类词性
- Problem of word classes: Ambiguity 词类问题:模糊性
- Tagsets 标记集
- Penn Treebank Tags:
- Derived Tags: 衍生标签
- Tagged Text Example 标记文本示例
- Reasons for automatic POS tagging 自动词性标注的原因
- Automatic Taggers 自动标注器
- Unknown Words
Part of Speech(POS)
-
Also called word classes, morphological classes, syntactic categories 也称为词类、形态类、句法类别
-
E.g.: nouns, verbs, adjective 例如:名词、动词、形容词
-
POS tells information about a word and its neighbors: 词性提供了关于单词及其相邻单词的信息
- Nouns are often preceded by determiners 名词通常由限定词前置
- Verbs preceded by nouns 动词通常由名词前置
- content as a noun pronounced as /'kɑ:ntent/
- content as an adjective pronounced as /kən’tent/
POS application: Information Extraction 词性应用:信息提取
-
Given sentence: “Brasilia, the Brazilian capital, was founded in 1960”
-
Extract information: 提取信息
- capital(Brazil, Brasilia)
- founded(Brasilia, 1960)
-
First step of information extraction is finding all POS tags: 信息提取的第一步是找到所有的词性标签
- nouns: Brasilia, capital
- adjective: Brazilian
- verbs: founded
- numbers: 1960
POS Open Class 开放类词性
-
Open vs. closed: How readily do POS categories take on new words? 开放类 vs. 封闭类:词性类别接受新词的频率如何?
-
E.g. of open classes: 开放类的例子
- Nouns:
- Proper(专有名词) vs. common(普通名词): Australia, wombat
- Mass(集合名词) vs. count(可数名词): rice, bowls
- Verbs:
- Rich inflection: go/goes/going/gone/went 富有变化
- Auxiliary verbs(助动词): be, have, do 助动词
- Transitivity: wait, hit, give 及物性
- Adjectives:
- Gradable(等级形容词) vs. non-gradable(非等级形容词): happy/happier/happiest, computational
- Adverbs:
- Manner(情状副词): slowly
- Locative(处所副词): here
- Degree(程度副词): really
- Temporal(时间副词): today
- Nouns:
-
E.g. of closed classes: 封闭类的例子
- Prepositions(介词):
- in, on, with, for, of, over
- Particles:
- off
- Determiners(限定词):
- Articles(冠词): a, an, the
- Demonstratives(指示词): this, that, these, those
- Quantifiers(数量词): each, every, some, two
- Pronouns(代词):
- Personal(人称代词): I, me, she
- Possessive(所有格代词): my, our
- Interrogative(疑问代词): who, what
- Conjunctions(连词):
- Coordinating(并列连词): and, or, but
- Subordinating(从属连词): if, although, that
- Modal verbs(情态动词):
- Ability: can, could
- Permission: can, may
- Possibility: may, might, could, will
- Necessity: must
- Prepositions(介词):
Problem of word classes: Ambiguity 词类问题:模糊性
-
Many word types belong to multiple classes 许多单词类型属于多个类别
-
POS depends on context 词性取决于上下文
-
E.g.: flies
- The word flies in the first sentence is an inflection of the verb “fly” 在第一句中,flies 是动词 “fly” 的变形
- The word flies in the second sentence is the plural form of the noun “fly” 在第二句中,flies 是名词 “fly” 的复数形式
Tagsets
Tagsets 标记集
-
A compact representation of POS information 词性信息的紧凑表示
- Usually less than 4 capitalized characters. E.g. NN = noun 通常少于4个大写字符。例如 NN = noun
- Often includes inflectional distinctions 经常包括形态变化的区别
-
Major English tagsets: 主要的英语标记集
- Brown: 87 tags
- Penn Treebank: 45 tags
- CLAWS/BNC: 61 tags
- Universal: 12 tags
-
At least one tagset for all major languages 所有主要语言至少有一个标记集
Penn Treebank Tags:
-
Open classes: 开放类
- NN: noun 名词
- VB: verb 动词
- JJ: adjective 形容词
- RB: adverb 副词
-
Closed classes: 封闭类
- DT: determiner 限定词
- CD: cardinal number 基数
- IN: preposition 介词
- PRP: personal pronoun 人称代词
- MD: modal 情态动词
- CC: coordinating conjunction 并列连词
- RP: particle 助词
- WH: wh-pronoun 疑问代词
- TO: to
Derived Tags: 衍生标签
-
Open classes: 开放类
- NN (noun singular): 单数名词
- NNS (plural) 复数
- NNP (proper) 专有名词
- NNPS (proper plural) 复数专有名词
- VB (verb infinitive): 不定式动词
- VBP (1st/2nd person present) 第一/第二人称现在时
- VBZ (3rd person singular) 第三人称单数
- VBD (past tense) 过去时
- VBG (gerund) 现在分词
- VBN (past participle) 过去分词
- JJ (adjective): 形容词
- JJR (comparative) 比较级
- JJS (superlative) 最高级
- RB (adverb): 副词
- RBR (comparative) 比较级
- RBS (superlative) 最高级
- NN (noun singular): 单数名词
-
Closed classes: 封闭类
- PRP (pronoun personal): 人称代词
- PRP$ (possessive) 所有格
- WP (wh-pronoun): 疑问代词
- WP$ (possessive) 所有格
- WDT (wh-determiner) 疑问限定词)
- WRB (wh-adverb) 疑问副词
- PRP (pronoun personal): 人称代词
Tagged Text Example 标记文本示例
Automatic Tagging
Reasons for automatic POS tagging 自动词性标注的原因
-
Important for morphological analysis. E.g. lemmatization 对形态分析很重要。例如:词形还原
-
For some applications, we want to focus on certain POS 对于某些应用,我们希望关注某些词性
- E.g. nouns are important for information retreieval, adjectives for sentiment analysis 例如:名词对于信息检索很重要,形容词对于情感分析很重要
-
Very useful features for certain classification tasks. 对于某些分类任务,这是非常有用的特性
- E.g. genre attribution 体裁属性
-
POS tags can offer word sense disambiguation 词性标签可以提供词义消歧
- E.g. cross/NN, cross/VB, cross/JJ all have different means
-
Can use them to create larger structures 可以用它们来创建更大的结构
Automatic Taggers 自动标注器
- Rule-based taggers 基于规则的标注器
- Statistical taggers 统计标注器
- Unigram tagger 一元标注器
- Classifier-based tagger 基于分类器的标注器
- Hidden Markov Model tagger 隐马尔科夫模型标注器
Rule-Based Tagging
- Typically starts with a list of possible tags for each word. Source from a lexical resource or a corpus 通常从词典或语料库中为每个单词列出可能的标签开始
- Often includes other lexcial information. E.g. verb subcategorization 经常包括其他词汇信息。例如:动词下类化
- Apply rules to narrow down to a single tag 应用规则以缩小到一个标签
- Large systems have thousands of constraints 大型系统有数千个约束
Unigram Tagger
- Assign most common tag to each word type 为每个单词类型分配最常见的标签
- Requires a corpus of tagged words 需要一个标记过的词语的语料库
- Just a look-up table 只是一个查找表
- Approximately 90% accuracy 精度约为90%
- Often considered the baseline for more complex approaches 通常被认为是更复杂方法的基线
Classifier-Based Tagging
-
Use a standard discriminative classifier like logistic regression or neural network with features: 使用如逻辑回归或神经网络这样的标准判别式分类器,其特征包括
- Target word 目标词
- Lexical context around the word 词周围的词汇上下文
- Already classified tags in the sentence 句子中已分类的标签
-
Can suffer from error propagation: wrong predictions from previous steps affect the next ones 可能受到错误传播的影响:前一步的错误预测影响下一步
Hidden Markov Models
- A basic sequential model 一个基本的序列模型
- Like sequential classifiers, use both previous tag and lexical evident 与序列分类器一样,使用前一个标签和词汇证据
- Unlike classifiers, considers all possibilities of previous tag and treat previous tag evidence and lexical evidence as independent from each other 与分类器不同的是,它考虑了前一个标签的所有可能性,并将前一个标签的证据和词汇证据视为相互独立的
- Less sparsity 稀疏度较小
- Fast algorithms for sequential prediction 针对序列预测的快速算法
Unknown Words
-
Huge problem in morphologically rich languages 在形态丰富的语言中是一个巨大的问题
-
Can use things already seen only once to best guess for things never seen before 可以使用已经看到一次的事物来对从未见过的事物进行最佳猜测
- Tend to be nouns, followed by verbs 倾向于是名词,然后是动词
- Unlikely to be determiners 不太可能是限定词
-
Can use sub-word representations to capture morphology 可以使用子词表示来捕获形态