欢迎您访问365答案网,请分享给你的朋友!
生活常识 学习资料

11-ES的自动补全

时间:2023-05-04
自动补全

在搜索框输入字符时,我们应该提示出与该字符有关的搜索项,因为需要根据拼音字母来推断,因此要用到拼音分词功能。

拼音分词器

要实现根据字母做补全,就必须对文档按照拼音分词。在GitHub上恰好有elasticsearch的拼音分词插件。地址:https://github.com/medcl/elasticsearch-analysis-pinyin

安装方式可以参考IK分词器的安装方式:https://editor.csdn.net/md/?articleId=122899932

测试拼音分词器

POST /_analyze{ "text": "谁把爷电动车骑走了?", "analyzer": "pinyin"}

结果:

{ "tokens" : [ { "token" : "shui", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "sbyddcqzl", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "ba", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "ye", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 }, { "token" : "dian", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 3 }, { "token" : "dong", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 4 }, { "token" : "che", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 5 }, { "token" : "qi", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 6 }, { "token" : "zou", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 7 }, { "token" : "le", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 8 } ]}

自定义分词器

默认的拼音分词器会将每个汉字单独分为拼音,而我们希望的是每个词条形成一组拼音,需要对拼音分词器做个性化定制,形成自定义分词器。

elasticsearch中分词器(analyzer)的组成包含三部分:

character filters:在tokenizer(分词器)之前对文本进行处理。例如删除字符、替换字符tokenizer:将文本按照一定的规则切割成词条(term)。例如keyword,就是不分词;还有ik_smarttokenizer filter:将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

文档分词时会依次由这三部分来处理文档。

声明自定义分词器:

PUT /test{ "settings": { "analysis": { "analyzer": {//自定义分词器 "my_analyer":{//分词器名称 "tokenizer": "ik_max_word", "filter": "py" } }, "filter": {//自定义tokenizer filter "py":{//过滤器名称 "type":"pinyin",//过滤器类型 "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } },"mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyer", "search_analyzer": "ik_smart" } } }}

测试自定义分词器

POST /test/_analyze{ "text": ["谁把爷电动车骑走了"], "analyzer": "my_analyer"}

结果:

{ "tokens" : [ { "token" : "谁", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "shui", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "s", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "把", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "ba", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "b", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "爷", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "ye", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "y", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "电动车", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "diandongche", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "ddc", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 3 }, { "token" : "电动", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "diandong", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "dd", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "车骑", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "cheqi", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "cq", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5 }, { "token" : "骑走", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "qizou", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "qz", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 6 }, { "token" : "走了", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 }, { "token" : "zoule", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 }, { "token" : "zl", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 7 } ]}

自动补全查询

elasticsearch提供了Completion Suggester查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率,对于文档中字段的类型有一些约束:

参与补全查询的字段必须是completion类型。

字段的内容一般是用来补全的多个词条形成的数组。

比如,一个这样的索引库:

PUT test1{ "mappings": {"properties": { "title":{ "type": "completion" } }}}

然后插入下面的数据:

// 示例数据POST test/_doc{ "title": ["Sony", "WH-1000XM3"]}POST test/_doc{ "title": ["SK-II", "PITERA"]}POST test/_doc{ "title": ["Nintendo", "switch"]}

查询的DSL语句如下:

GET test1/_search{ "suggest": { "title_suggest": { "text": "s",//关键字 "completion":{ "field":"title",//补全查询的字段 "skip_duplicates":true,//跳过重复的 "size":10//获取前10条结果 } } }}

示例

实体类:

@Data@NoArgsConstructorpublic class HotelDoc { private Long id; private String name; private String address; private Integer price; private Integer score; private String brand; private String city; private String starName; private String business; private String location; private String pic; private Object distance; private Boolean isAD; private List suggestion; public HotelDoc(Hotel hotel) { this.id = hotel.getId(); this.name = hotel.getName(); this.address = hotel.getAddress(); this.price = hotel.getPrice(); this.score = hotel.getScore(); this.brand = hotel.getBrand(); this.city = hotel.getCity(); this.starName = hotel.getStarName(); this.business = hotel.getBusiness(); this.location = hotel.getLatitude() + ", " + hotel.getLongitude(); this.pic = hotel.getPic(); // 组装suggestion if(this.business.contains("/")){ // business有多个值,需要切割 String[] arr = this.business.split("/"); // 添加元素 this.suggestion = new ArrayList<>(); this.suggestion.add(this.brand); Collections.addAll(this.suggestion, arr); }else { this.suggestion = Arrays.asList(this.brand, this.business); } }}

实现自动补全查询的方法:

@Overridepublic List getSuggestions(String prefix) { try { // 1.准备Request SearchRequest request = new SearchRequest("hotel"); // 2.准备DSL request.source().suggest(new SuggestBuilder().addSuggestion( "suggestions", SuggestBuilders.completionSuggestion("suggestion") .prefix(prefix) .skipDuplicates(true) .size(10) )); // 3.发起请求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4.解析结果 Suggest suggest = response.getSuggest(); // 4.1.根据补全查询名称,获取补全结果 CompletionSuggestion suggestions = suggest.getSuggestion("suggestions"); // 4.2.获取options List options = suggestions.getOptions(); // 4.3.遍历 List list = new ArrayList<>(options.size()); for (CompletionSuggestion.Entry.Option option : options) { String text = option.getText().toString(); list.add(text); } return list; } catch (IOException e) { throw new RuntimeException(e); }}

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网 版权所有 备案号:

部分内容来自互联网,版权归原作者所有,如有冒犯请联系我们,我们将在三个工作时内妥善处理。