从零搭建 ES 搜索服务（四）拼音搜索

摘要：

3.1具体配置① 修改上一节中新创建的“yb_knowledge.json”模板中的“设置”配置，添加自定义分析器“analysis”：｛“filter”：｝…忽略其余的…“pinyin_filter”：｛“type”：“pinyin”，“keep_first_letter”：true，“keep.parate_first_letter”：false，“keep_full_pinyin”：true、“keep_joined_full_pin”：true；“none_chinese_pinyin_tokenize”：false、“keep_joined_fall_pinyin”：true n“：true，”remove_uplicated_term“：true、”keep_original“：true”，“limit_first_letter_length”：50，”小写字母“：true｝｝”，“分析器”：｛…忽略其余…“ik_synonym_pinyin”：{“type”：“custom”，“token”：“ik_smart”、“filter”：。

一、前言

上篇介绍了 ES 的同义词搜索，使我们的搜索更强大了，然而这还远远不够，在实际使用中还可能希望搜索「fanqie」能将包含「番茄」的结果也罗列出来，这就涉及到拼音搜索了，本篇将介绍如何具体实现。

二、安装 ES 拼音插件

2.1 拼音插件简介

GitHub 地址：https://github.com/medcl/elasticsearch-analysis-pinyin

2.2 安装步骤

① 进入 ES 的 bin 目录

$ cd /usr/local/elasticsearch/bin/

② 通过 elasticsearch-plugin 命令安装 pinyin 插件

$ ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v5.5.3/elasticsearch-analysis-pinyin-5.5.3.zip

③ 安装成功后会在 plugins 目录出现 analysis-pinyin 文件夹

三、自定义分析器

要使用「拼音插件」需要在创建索引时使用「自定义模板」并在自定义模板中「自定义分析器」。

3.1 具体配置

① 在上篇新建的「 yb_knowledge.json 」模板中修改「 setting 」配置，往其中添加自定义分析器

"analysis": {
    "filter": {
        ...省略其余部分...
        "pinyin_filter":{
            "type": "pinyin",
            "keep_first_letter": true,
            "keep_separate_first_letter": false,
            "keep_full_pinyin": true,
            "keep_joined_full_pinyin": true,                     
            "none_chinese_pinyin_tokenize": false,
            "keep_joined_full_pinyin": true,
            "remove_duplicated_term": true,
            "keep_original": true,
            "limit_first_letter_length": 50,
            "lowercase": true
        }
    },
    "analyzer": {
        ...省略其余部分...
        "ik_synonym_pinyin": {
            "type": "custom",
            "tokenizer": "ik_smart",
            "filter": ["synonym_filter","pinyin_filter"]
        }
    }
}

自定义分析器说明：

首先声明一个新「 token filter 」—— 「 pinyin_filter 」，其中 type 为 pinyin 即拼音插件，其余字段详见 GitHub 项目说明。
其次声明一个新「analyzer」—— 「ik_synonym_pinyin」，其中 type 为 custom 即自定义类型， tokenizer 为 ik_smart 即使用 ik 分析器的 ik_smart 分词模式， filter 为要使用的词过滤器，可以使用多个，这里使用了上述定义的 pinyin_filter 以及前篇的 synonym_filter 。

② 与此同时修改「 mappings 」中的 properties 配置，往「 knowledgeTitle 」及「 knowledgeContent 」这两个搜索字段里添加 fields 参数，它支持以不同方式对同一字段做索引，将原本的简单映射转化为多字段映射，此处设置一个名为「 pinyin 」的嵌套字段且使用上述自定义的「 ik_synonym_pinyin 」作为分析器。

"mappings": {
    "knowledge": {
        ...省略其余部分...
        "properties": {
            ...省略其余部分...
            "knowledgeTitle": {
                    "type": "text",
                    "analyzer": "ik_synonym_max",
                    "fields":{
                        "pinyin": {
                            "type":"text",
                            "analyzer": "ik_synonym_pinyin"
                        }
                    }
                },
                "knowledgeContent": {
                    "type": "text",
                    "analyzer": "ik_synonym_max",
                    "fields":{
                        "pinyin": {
                            "type":"text",
                            "analyzer": "ik_synonym_pinyin"
                        }
                    }
                }
        }
    }
}

③ 最后删除先前创建的 yb_knowledge 索引并重启 Logstash

注：重建索引后可以通过「_analyze」测试分词结果

curl -XGET http://localhost:9200/yb_knowledge/_analyze
{
    "analyzer":"ik_synonym_pinyin",
    "text":"番茄"
}

注：在添加了同义词「番茄、西红柿、圣女果」的基础上分词结果如下

{
    "tokens": [
        {
            "token": "fan",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "番茄",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "fanqie",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "fq",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "qie",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "xi",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "hong",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "shi",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 4
        },
        {
            "token": "西红柿",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 4
        },
        {
            "token": "xihongshi",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 4
        },
        {
            "token": "xhs",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 4
        },
        {
            "token": "sheng",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 5
        },
        {
            "token": "nv",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 6
        },
        {
            "token": "guo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 7
        },
        {
            "token": "圣女果",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 7
        },
        {
            "token": "shengnvguo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 7
        },
        {
            "token": "sng",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 7
        }
    ]
}

四、结语

至此拼音搜索已经实现完毕，最近两篇都是有关 ES 插件以及 Logstash 自定义模板的配置，没有涉及具体的 JAVA 代码实现，下一篇将介绍如何通过 JAVA API 实现搜索结果高亮。

免责声明：文章转载自《从零搭建 ES 搜索服务（四）拼音搜索》仅用于学习参考。如对内容有疑问，请及时联系本站处理。

从零搭建 ES 搜索服务（四）拼音搜索

一、前言

二、安装 ES 拼音插件

2.1 拼音插件简介

2.2 安装步骤

三、自定义分析器

3.1 具体配置

四、结语

相关文章

shiro 框架基本讲解【转载】

【SpringSecurity】初识与集成

微信内 H5 页面自定义分享

uboot常用命令详解

VS Code找回Settings Sync的gist id和token

Kafka长文总结

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表