11-ES的自动补全

时间：2023-05-04

自动补全

在搜索框输入字符时，我们应该提示出与该字符有关的搜索项，因为需要根据拼音字母来推断，因此要用到拼音分词功能。

拼音分词器

要实现根据字母做补全，就必须对文档按照拼音分词。在GitHub上恰好有elasticsearch的拼音分词插件。地址：https://github.com/medcl/elasticsearch-analysis-pinyin

安装方式可以参考IK分词器的安装方式：https://editor.csdn.net/md/?articleId=122899932

测试拼音分词器

POST /_analyze{ "text": "谁把爷电动车骑走了？", "analyzer": "pinyin"}

结果：

{ "tokens" : [ { "token" : "shui", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "sbyddcqzl", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "ba", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "ye", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 }, { "token" : "dian", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 3 }, { "token" : "dong", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 4 }, { "token" : "che", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 5 }, { "token" : "qi", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 6 }, { "token" : "zou", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 7 }, { "token" : "le", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 8 } ]}

自定义分词器

默认的拼音分词器会将每个汉字单独分为拼音，而我们希望的是每个词条形成一组拼音，需要对拼音分词器做个性化定制，形成自定义分词器。

elasticsearch中分词器（analyzer）的组成包含三部分：

character filters：在tokenizer（分词器）之前对文本进行处理。例如删除字符、替换字符tokenizer：将文本按照一定的规则切割成词条（term）。例如keyword，就是不分词；还有ik_smarttokenizer filter：将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

文档分词时会依次由这三部分来处理文档。

声明自定义分词器：

PUT /test{ "settings": { "analysis": { "analyzer": {//自定义分词器 "my_analyer":{//分词器名称 "tokenizer": "ik_max_word", "filter": "py" } }, "filter": {//自定义tokenizer filter "py":{//过滤器名称 "type":"pinyin",//过滤器类型 "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } },"mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyer", "search_analyzer": "ik_smart" } } }}

测试自定义分词器

POST /test/_analyze{ "text": ["谁把爷电动车骑走了"], "analyzer": "my_analyer"}

结果：

{ "tokens" : [ { "token" : "谁", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "shui", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "s", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "把", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "ba", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "b", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "爷", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "ye", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "y", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "电动车", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "diandongche", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "ddc", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "电动", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "diandong", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "dd", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "车骑", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "cheqi", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "cq", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "骑走", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "qizou", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "qz", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "走了", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 }, { "token" : "zoule", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 }, { "token" : "zl", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 } ]}

自动补全查询

elasticsearch提供了Completion Suggester查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率，对于文档中字段的类型有一些约束：

参与补全查询的字段必须是completion类型。

字段的内容一般是用来补全的多个词条形成的数组。

比如，一个这样的索引库：

PUT test1{ "mappings": {"properties": { "title":{ "type": "completion" } }}}

然后插入下面的数据：

// 示例数据POST test/_doc{ "title": ["Sony", "WH-1000XM3"]}POST test/_doc{ "title": ["SK-II", "PITERA"]}POST test/_doc{ "title": ["Nintendo", "switch"]}

查询的DSL语句如下：

GET test1/_search{ "suggest": { "title_suggest": { "text": "s",//关键字 "completion":{ "field":"title",//补全查询的字段 "skip_duplicates":true,//跳过重复的 "size":10//获取前10条结果 } } }}

示例

实体类：

@Data@NoArgsConstructorpublic class HotelDoc { private Long id; private String name; private String address; private Integer price; private Integer score; private String brand; private String city; private String starName; private String business; private String location; private String pic; private Object distance; private Boolean isAD; private List suggestion; public HotelDoc(Hotel hotel) { this.id = hotel.getId(); this.name = hotel.getName(); this.address = hotel.getAddress(); this.price = hotel.getPrice(); this.score = hotel.getScore(); this.brand = hotel.getBrand(); this.city = hotel.getCity(); this.starName = hotel.getStarName(); this.business = hotel.getBusiness(); this.location = hotel.getLatitude() + ", " + hotel.getLongitude(); this.pic = hotel.getPic(); // 组装suggestion if(this.business.contains("/")){ // business有多个值，需要切割 String[] arr = this.business.split("/"); // 添加元素 this.suggestion = new ArrayList<>(); this.suggestion.add(this.brand); Collections.addAll(this.suggestion, arr); }else { this.suggestion = Arrays.asList(this.brand, this.business); } }}

实现自动补全查询的方法：

@Overridepublic List getSuggestions(String prefix) { try { // 1.准备Request SearchRequest request = new SearchRequest("hotel"); // 2.准备DSL request.source().suggest(new SuggestBuilder().addSuggestion( "suggestions", SuggestBuilders.completionSuggestion("suggestion") .prefix(prefix) .skipDuplicates(true) .size(10) )); // 3.发起请求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4.解析结果 Suggest suggest = response.getSuggest(); // 4.1.根据补全查询名称，获取补全结果 CompletionSuggestion suggestions = suggest.getSuggestion("suggestions"); // 4.2.获取options List options = suggestions.getOptions(); // 4.3.遍历 List list = new ArrayList<>(options.size()); for (CompletionSuggestion.Entry.Option option : options) { String text = option.getText().toString(); list.add(text); } return list; } catch (IOException e) { throw new RuntimeException(e); }}

上一篇：家居企业使用了WMS仓储管理系统，有哪些变化

下一篇：Windows安装RabbitMQ