什么是分词器
1、分词器介绍 对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。
常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart;英文分词器standard
ik_max_word会将文本做最细粒度的拆分,会穷尽各种可能的组合,适合 Term Query;
ik_smart:会做最粗粒度的拆分,适合 Phrase 查询
下面是对分词器使用的语句:
GET _analyze
{
"text": ["布布努力学习编程"]
,"analyzer": "ik_max_word"
}
{
"tokens" : [
{
"token" : "布",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "布",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "努力学习",
"start_offset" : 2,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "努力",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "力学",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "学习",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "编程",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
}
]
}
GET _analyze
{
"text": ["布布努力学习编程"]
,"analyzer": "ik_smart"
}
{
"tokens" : [
{
"token" : "布",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "布",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "努力学习",
"start_offset" : 2,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "编程",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
}
]
}
GET _analyze
{
"text": ["布布努力学习编程"]
,"analyzer": "standard"
}
{
"tokens" : [
{
"token" : "布",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "布",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "努",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "力",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "学",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "习",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "编",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "程",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
}
]
}
分词器的生效时间
- 在创建索引的时候会把索引中text类型的字段按照mapping中配置的分词器进行分词存储倒排索引;
- 在查询的时候全文检索,会对搜索条件进行分词做为查询条件去和创建索引时的分词匹配;
分词器的组成
切词器(Tokenizer):用于定义切词(分词)逻辑
词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑
字符过滤器(Character Filter):用于处理单个字符
注意:分词器不会对源数据产生影响,分词只是对倒排索引以及搜索词的行为
字符过滤器(Character Filter)
定义:分词之前的预处理,过滤无用字符
字符过滤器分为三种:
字符过滤器-HTML标签过滤器:HTML Strip Character Filter
过滤html标签
html_strip
参数:escaped_tags 需要保留的html标签
“type”: “html_strip”
DELETE test_html_strip_filter
#字符过滤器
PUT test_html_strip_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": [
"a"
]
}
}
}
}
}
GET test_html_strip_filter/_analyze
{
"tokenizer": "standard",
"char_filter": ["my_char_filter"],
"text": ["<p>I'm so <a>happy</a>!</p>"]
}
结果:
{
"tokens" : [
{
"token" : "I'm",
"start_offset" : 3,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "happy",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "a",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
上面语句的作用是:对text文本,只保留标签,其余标签不展示
字符过滤器-字符映射过滤器:Mapping Character Filter
通过在索引的mapping映射中指定对某些字符的替换从而完成特定字符的过滤
“type”:“mapping”
##Mapping Character Filter
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"mapping",
"mappings":[
"臭 => *",
"傻=> *",
"逼=> *"
]
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "你就是个臭傻逼"
}
{
"tokens" : [
{
"token" : "你就是个***",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
}
]
}
字符过滤器-正则替换过滤器:Pattern Replace Character Filter
“type”:"pattern_replace"表示正则替换
##Pattern Replace Character Filter
#17611001200
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"pattern_replace",
"pattern":"(\\d{3})\\d{4}(\\d{4})",
"replacement":"$1****$2"
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "您的手机号是17611001200"
}
{
"tokens" : [
{
"token" : "您的手机号是184****6831",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
}
]
}
词项过滤器(Token Filter)
词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。官方同样预置了很多词项过滤器,基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。
standard转大小写、停用词
#转为大写
GET _analyze
{
"tokenizer": "standard",
"filter": ["uppercase"],
"text": ["www elastic co guide"]
}
#转为小写
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": ["WWW ELASTIC CO GUIDE"]
}
停用词
在切词完成之后,会被干掉词项,即停用词。停用词可以自定义
在分词器插件的配置文件中可以看到停用词的定义
GET _analyze
{
"tokenizer": "standard",
"filter": ["stop"],
"text": ["what are you doing"]
}
这是IK分词器的停用词
自定义停用词
### 自定义 filter
PUT test_token_filter_stop
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "stop",
"stopwords": [
"www"
],
"ignore_case": true
}
}
}
}
}
GET test_token_filter_stop/_analyze
{
"tokenizer": "standard",
"filter": ["my_filter"],
"text": ["What www WWW are you doing"]
}
同义词
#同义词
PUT test_token_filter_synonym
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": ["good, nice => excellent"]
}
}
}
}
}
GET test_token_filter_synonym/_analyze
{
"tokenizer": "standard",
"filter": ["my_synonym"],
"text": ["good"]
}
切词器:Tokenizer
tokenizer 是分词器的核心组成部分之一,其主要作用是分词,或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term,或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器,默认的切词器为 standard。