【2024软考架构案例题】你知道 Es 的几种分词器吗？Standard、Simple、WhiteSpace、Keyword 四种分词器你知道吗？

👉博主介绍：博主从事应用安全和大数据领域，有8年研发经验，5年面试官经验，Java技术专家，WEB架构师，阿里云专家博主，华为云云享专家，51CTO 专家博主

⛪️ 个人社区：个人社区
💞 个人主页：个人主页
🙉 专栏地址： ✅ Java 中级
🙉八股文专题：剑指大厂，手撕 Java 八股文

在这里插入图片描述

文章目录

- - 1. 什么是 Standard 分词器？
  - 2. 什么是 Simple 分词器？
  - 3. 什么是 WhiteSpace 分词器？
  - 4. 什么是 Keyword 分词器？

1. 什么是 Standard 分词器？

Standard 分词器（Standard Tokenizer）是 Elasticsearch 和 Lucene 中最常用的分词器之一。它主要用于处理自然语言文本，能够识别单词、数字、电子邮件地址、URL 等，并将它们分割成单独的词元（tokens）。Standard 分词器遵循 Unicode 文本分段算法（Unicode Text Segmentation Algorithm），能够处理多种语言的文本。

特点：

识别单词：能够识别常见的单词边界。
处理标点符号：会忽略大多数标点符号，但保留电子邮件地址和 URL。
处理数字：能够识别并保留数字。
处理特殊字符：能够处理一些特殊字符，如连字符和撇号。

示例：

POST _analyze
{
  "analyzer": "standard",
  "text": "Elasticsearch is a powerful search engine. Visit https://www.elastic.co for more information."
}

输出：

{
  "tokens": [
    { "token": "elasticsearch", "start_offset": 0, "end_offset": 11, "type": "<ALPHANUM>", "position": 0 },
    { "token": "is", "start_offset": 12, "end_offset": 14, "type": "<ALPHANUM>", "position": 1 },
    { "token": "a", "start_offset": 15, "end_offset": 16, "type": "<ALPHANUM>", "position": 2 },
    { "token": "powerful", "start_offset": 17, "end_offset": 25, "type": "<ALPHANUM>", "position": 3 },
    { "token": "search", "start_offset": 26, "end_offset": 32, "type": "<ALPHANUM>", "position": 4 },
    { "token": "engine", "start_offset": 33, "end_offset": 39, "type": "<ALPHANUM>", "position": 5 },
    { "token": "visit", "start_offset": 41, "end_offset": 46, "type": "<ALPHANUM>", "position": 6 },
    { "token": "https", "start_offset": 47, "end_offset": 52, "type": "<ALPHANUM>", "position": 7 },
    { "token": "www.elastic.co", "start_offset": 53, "end_offset": 68, "type": "<ALPHANUM>", "position": 8 },
    { "token": "for", "start_offset": 70, "end_offset": 73, "type": "<ALPHANUM>", "position": 9 },
    { "token": "more", "start_offset": 74, "end_offset": 78, "type": "<ALPHANUM>", "position": 10 },
    { "token": "information", "start_offset": 79, "end_offset": 90, "type": "<ALPHANUM>", "position": 11 }
  ]
}

2. 什么是 Simple 分词器？

Simple 分词器（Simple Tokenizer）是一个简单的分词器，它将文本按非字母字符（如空格、标点符号等）分割成词元。它只保留字母字符，并将所有字母转换为小写。

特点：

简单分割：只按非字母字符分割。
小写转换：将所有字母转换为小写。
不处理数字：数字被视为非字母字符，会被分割掉。

示例：

POST _analyze
{
  "tokenizer": "simple_pattern",
  "text": "Elasticsearch is a powerful search engine. Visit www.elastic.co for more information."
}

输出：

{
  "tokens": [
    { "token": "elasticsearch", "start_offset": 0, "end_offset": 11, "type": "word", "position": 0 },
    { "token": "is", "start_offset": 12, "end_offset": 14, "type": "word", "position": 1 },
    { "token": "a", "start_offset": 15, "end_offset": 16, "type": "word", "position": 2 },
    { "token": "powerful", "start_offset": 17, "end_offset": 25, "type": "word", "position": 3 },
    { "token": "search", "start_offset": 26, "end_offset": 32, "type": "word", "position": 4 },
    { "token": "engine", "start_offset": 33, "end_offset": 39, "type": "word", "position": 5 },
    { "token": "visit", "start_offset": 41, "end_offset": 46, "type": "word", "position": 6 },
    { "token": "wwwelasticco", "start_offset": 50, "end_offset": 62, "type": "word", "position": 7 },
    { "token": "for", "start_offset": 64, "end_offset": 67, "type": "word", "position": 8 },
    { "token": "more", "start_offset": 68, "end_offset": 72, "type": "word", "position": 9 },
    { "token": "information", "start_offset": 73, "end_offset": 84, "type": "word", "position": 10 }
  ]
}

3. 什么是 WhiteSpace 分词器？

WhiteSpace 分词器（Whitespace Tokenizer）是最简单的分词器之一，它仅按空格分割文本，不处理其他标点符号或特殊字符。

特点：

按空格分割：只按空格分割文本。
保留所有字符：不忽略任何字符，包括标点符号和数字。

示例：

POST _analyze
{
  "tokenizer": "whitespace",
  "text": "Elasticsearch is a powerful search engine. Visit www.elastic.co for more information."
}

输出：

{
  "tokens": [
    { "token": "Elasticsearch", "start_offset": 0, "end_offset": 11, "type": "word", "position": 0 },
    { "token": "is", "start_offset": 12, "end_offset": 14, "type": "word", "position": 1 },
    { "token": "a", "start_offset": 15, "end_offset": 16, "type": "word", "position": 2 },
    { "token": "powerful", "start_offset": 17, "end_offset": 25, "type": "word", "position": 3 },
    { "token": "search", "start_offset": 26, "end_offset": 32, "type": "word", "position": 4 },
    { "token": "engine.", "start_offset": 33, "end_offset": 40, "type": "word", "position": 5 },
    { "token": "Visit", "start_offset": 41, "end_offset": 46, "type": "word", "position": 6 },
    { "token": "www.elastic.co", "start_offset": 47, "end_offset": 62, "type": "word", "position": 7 },
    { "token": "for", "start_offset": 63, "end_offset": 66, "type": "word", "position": 8 },
    { "token": "more", "start_offset": 67, "end_offset": 71, "type": "word", "position": 9 },
    { "token": "information.", "start_offset": 72, "end_offset": 85, "type": "word", "position": 10 }
  ]
}

4. 什么是 Keyword 分词器？

Keyword 分词器（Keyword Tokenizer）是一个不分词的分词器，它将整个输入文本作为一个单一的词元处理。这意味着输入文本不会被分割成多个词元。

特点：

不分词：将整个输入文本作为一个词元处理。
保留原样：不进行任何转换或修改。

示例：

POST _analyze
{
  "tokenizer": "keyword",
  "text": "Elasticsearch is a powerful search engine. Visit www.elastic.co for more information."
}

输出：

{
  "tokens": [
    { "token": "Elasticsearch is a powerful search engine. Visit www.elastic.co for more information.", "start_offset": 0, "end_offset": 85, "type": "word", "position": 0 }
  ]
}