命名实体识别NER（综合代码示例）

一、命名实体识别发展方向

二、中文数据集

CCKS2017开放的中文的电子病例测评相关的数据。
评测任务一：https://biendata.com/competition/CCKS2017_1/
评测任务二：https://biendata.com/competition/CCKS2017_2/
CCKS2018开放的音乐领域的实体识别任务。
评测任务：https://biendata.com/competition/CCKS2018_2/
(CoNLL 2002)Annotated Corpus for Named Entity Recognition。
地址：https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
NLPCC2018开放的任务型对话系统中的口语理解评测。
地址：http://tcci.ccf.org.cn/conference/2018/taskdata.php
一家公司提供的数据集,包含人名、地名、机构名、专有名词。
下载地址：https://bosonnlp.com/dev/resource

三、相关代码示例

1.Hanlp

HanLP是一系列模型与算法组成的NLP工具包，由大快搜索主导并完全开源，目标是普及自然语言处理在生产环境中的应用。支持命名实体识别。 Github地址：https://github.com/hankcs/pyhanlp

官网：http://hanlp.linrunsoft.com/

# 安装：pip install pyhanlp
# 国内源安装：pip install pyhanlp  -i https://pypi.tuna.tsinghua.edu.cn/simple
# 通过crf算法识别实体
from pyhanlp import *
# 音译人名示例
CRFnewSegment = HanLP.newSegment("crf")
term_list = CRFnewSegment.seg("我爱北京天安门！")
print(term_list)

[我/r, 爱/v, 北京/ns, 天安门/ns, ！/w]

2.NLTK

NLTK是一个高效的Python构建的平台,用来处理人类自然语言数据。

Github地址：https://github.com/nltk/nltk 官网：http://www.nltk.org/

# 安装：pip install nltk
# 国内源安装：pip install nltk  -i https://pypi.tuna.tsinghua.edu.cn/simple
import nltk
s = 'I love natural language processing technology!'
s_token = nltk.word_tokenize(s)
s_tagged = nltk.pos_tag(s_token)
s_ner = nltk.chunk.ne_chunk(s_tagged)
print(s_ner)

3.SpaCy

工业级的自然语言处理工具，遗憾的是不支持中文。 Gihub地址： https://github.com/explosion/spaCy 官网：https://spacy.io/

# 安装：pip install spaCy
# 国内源安装：pip install spaCy  -i https://pypi.tuna.tsinghua.edu.cn/simple
import spacy 
eng_model = spacy.load('en')
s = 'I want to Beijing learning natural language processing technology!'
# 命名实体识别
s_ent = eng_model(s)
for ent in s_ent.ents:
   print(ent, ent.label_, ent.label)

Beijing GPE 382

4.Stanford NER

斯坦福大学开发的基于条件随机场的命名实体识别系统，该系统参数是基于CoNLL、MUC-6、MUC-7和ACE命名实体语料训练出来的。

地址：https://nlp.stanford.edu/software/CRF-NER.shtml

python实现的Github地址：https://github.com/Lynten/stanford-corenlp

# 安装：pip install stanfordcorenlp
# 国内源安装：pip install stanfordcorenlp -i https://pypi.tuna.tsinghua.edu.cn/simple
# 使用stanfordcorenlp进行命名实体类识别
# 先下载模型，下载地址：https://nlp.stanford.edu/software/corenlp-backup-download.html
# 对中文进行实体识别
from stanfordcorenlp import StanfordCoreNLP
zh_model = StanfordCoreNLP(r'stanford-corenlp-full-2018-02-27', lang='zh')
s_zh = '我爱自然语言处理技术！'
ner_zh = zh_model.ner(s_zh)
s_zh1 = '我爱北京天安门！'
ner_zh1 = zh_model.ner(s_zh1)
print(ner_zh)
print(ner_zh1)

[('我爱', 'O'), ('自然', 'O'), ('语言', 'O'), ('处理', 'O'), ('技术', 'O'), ('！', 'O')]
[('我爱', 'O'), ('北京', 'STATE_OR_PROVINCE'), ('天安门', 'FACILITY'), ('！', 'O')]


# 对英文进行实体识别
eng_model = StanfordCoreNLP(r'stanford-corenlp-full-2018-02-27')
s_eng = 'I love natural language processing technology!'
ner_eng = eng_model.ner(s_eng)
s_eng1 = 'I love Beijing Tiananmen!'
ner_eng1 = eng_model.ner(s_eng1)
print(ner_eng)
print(ner_eng1)

[('I', 'O'), ('love', 'O'), ('natural', 'O'), ('language', 'O'), ('processing', 'O'), ('technology', 'O'), ('!', 'O')]
[('I', 'O'), ('love', 'O'), ('Beijing', 'CITY'), ('Tiananmen', 'LOCATION'), ('!', 'O')]

5.Crfsuite

可以载入自己的数据集去训练CRF实体识别模型。

文档地址：

https://sklearn-crfsuite.readthedocs.io/en/latest/?badge=latest

代码已上传：https://github.com/yuquanle/StudyForNLP/blob/master/NLPbasic/NER.ipynb

四、总结

命名实体识别是自然语言处理应用中的重要步骤，它不仅检测出实体边界，还检测出命名实体的类型，是文本意义理解的基础。本文阐述了命名实体识别的研究进展，从早期基于规则和词典的方法，到传统机器学习的方法，到近年来基于深度学习的方法，神经网络与 CRF 模型相结合的 NN-CRF 模型依旧是目前命名实体识别的主流模型。未来的研究中，数据标注和非正式文本（评论、论坛发言等未出现过的实体）仍会是两个挑战。迁移学习、对抗学习、远监督学习方法以及图神经网络、注意力机制、NER模型压缩、多类别实体、嵌套实体、实体识别和实体链接联合任务等都会是NER未来研究的重点。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/429786.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！