ES-09-ElasticSearch分词器

时间：2023-04-18

说明

ElasticSearch分词器默认分词器（标准分词器）、ik分词器、ik分词器扩展字典自定义词语关键词：keyword、text、ik_max_word、ik_smart、词条、词典、倒排表官方文档：https://www.elastic.co/cn/ik分词器文档：https://github.com/medcl/elasticsearch-analysis-ik 核心概念》数据类型说明

keyword：关键词，不能被分词text：普通文本，可以被分词》分词器概念

词条：索引中最小的存储和查询单元词典：字典，词条的集合。B+，hashMap倒排表：词条和文档ID的对照关系表》默认分词器

默认分词器：standard（标准分词器）默认分词器对中文不友好，默认所有中文都会被分为单个汉字》ik分词器

处理中文分词非常友好，会将中文分为词组提供了细粒度分词（ik_max_word）、粗粒度分词（ik_smart）两种选项》ik分词器扩展字典

有时候一些专用自定义词语分词器是无法正确分词的，需要我们自定义扩展字典，ik分词器提供了该功能。操作步骤》使用默认分词器分词

使用默认分词器尝试分词一句中文

请求示例

请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '{ "analyzer": "standard", "text": "一句中文。"}'

analyzer：分析器，不填默认就是standard（标准分析器）

响应结果：

{ "tokens": [ { "token": "一", "start_offset": 0, "end_offset": 1, "type": "", "position": 0 }, { "token": "句", "start_offset": 1, "end_offset": 2, "type": "", "position": 1 }, { "token": "中", "start_offset": 2, "end_offset": 3, "type": "", "position": 2 }, { "token": "文", "start_offset": 3, "end_offset": 4, "type": "", "position": 3 } ]}

》安装ik分词器

下载插件：https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.9.3

[root@192 ES]# lltotal 5124-rw-r--r--、1 501 games 4504423 Jan 23 15:40 elasticsearch-analysis-ik-7.9.3.zip

注意：下载的版本需要和你本地ES版本一致

解压缩到你的ES根目录下的plugins目录下，并更改所属用户和组为es

[root@192 plugins]# pwd/usr/local/es/7.9.3/plugins[root@192 plugins]# chown es:es elasticsearch-analysis-ik-7.9.3 -R[root@192 plugins]# lltotal 0drwx------、3 es es 243 Jan 23 15:41 elasticsearch-analysis-ik-7.9.3

切换为es用户并重启ES服务（我之前已经停止，这里直接启动）

[es@192 7.9.3]$ pwd/usr/local/es/7.9.3[es@192 7.9.3]$ bin/elasticsearch...[2099-01-23T15:46:02,970][INFO ][o.e.p.PluginsService ] [node-1] loaded plugin [analysis-ik]...

通过启动日志可以看到已经成功加载了analysis-ik 》使用ik分词器分词

使用默认分词器尝试分词一句中文

请求示例

请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '{ "analyzer": "ik_smart", "text": "一句中文。"}'

响应结果：

{ "tokens": [ { "token": "一句", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "中文", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 } ]}

》ik分词器扩展字典实现自定义词语分词

自定义词语：粉奶方配儿幼

切换到ik分词器配置文件夹

[es@192 config]$ pwd/usr/local/es/7.9.3/plugins/elasticsearch-analysis-ik-7.9.3/config

新建扩展字典文件，并加入自定义词语

[es@192 config]$ vi custom.dic粉奶方配儿幼

关联扩展字典

打开配置文件：

<?xml version="1.0" encoding="UTF-8"?> IK Analyzer 扩展配置

修改后的内容：

<?xml version="1.0" encoding="UTF-8"?> IK Analyzer 扩展配置 custom.dic

注意修改了这一行：custom.dic

切换为es用户并重启ES服务（我之前已经停止，这里直接启动）

[es@192 7.9.3]$ bin/elasticsearch

请求示例

Postman发送GET请求到如下URL：http://192.168.3.201:9200/_analyze

请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '{ "analyzer": "ik_smart", "text": "粉奶方配儿幼。"}'

响应结果：

{ "tokens": [ { "token": "粉奶方配儿幼", "start_offset": 0, "end_offset": 6, "type": "CN_WORD", "position": 0 } ]}

》ik分词器在索引文档中使用

创建一个索引：

curl -X PUT http://192.168.3.201:9200/index001

给索引创建mapping：

curl -X POST http://192.168.3.201:9200/index001/_mapping -H 'Content-Type:application/json' -d'{ "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" } }}'

创建文档：

curl -X POST http://192.168.3.201:9200/index001/_create/1 -H 'Content-Type:application/json' -d'{ "content": "中国人民万岁。"}'

高亮查询：

curl -X POST http://192.168.3.201:9200/index001/_search -H 'Content-Type:application/json' -d'{ "query": { "match": { "content": "中国" } }, "highlight": { "pre_tags": [ "", "" ], "post_tags": [ "", "" ], "fields": { "content": {} } }}'

响应结果：

{ "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.2876821, "hits": [ { "_index": "index001", "_type": "_doc", "_id": "1", "_score": 0.2876821, "_source": { "content": "中国人民万岁。" }, "highlight": { "content": [ "中国人民万岁。" ] } } ] }}

上一篇：2022年中国数据治理行业：全链产业图谱梳理

下一篇：第3.8章：StarRocks数据导入--SparkLoad