起因
需要在ES中使用聚合进行统计分析,但是聚合字段值为中文,ES的默认分词器对于中文支持非常不友好:会把完整的中文词语拆分为一系列独立的汉字进行聚合,显然这并不是我的初衷。我们来看个实例:
POST http://192.168.80.133:9200/my_index_name/my_type_name/_search{ "size": 0, "query" : { "range" : { "time": { "gte": 1513778040000, "lte": 1513848720000 } } }, "aggs": { "keywords": { "terms": {"field": "keywords"}, "aggs": { "emotions": { "terms": {"field": "emotion"} } } } }}
输出结果:
{ "took": 22, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 32, "max_score": 0.0, "hits": [] }, "aggregations": { "keywords": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "力", # 完整的词被拆分为独立的汉字 "doc_count": 2, "emotions": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": -1, "doc_count": 1 }, { "key": 0, "doc_count": 1 } ] } }, { "key": "动", "doc_count": 2, "emotions": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": -1, "doc_count": 1 }, { "key": 0, "doc_count": 1 } ] } } ] } }}
既然ES的默认分词器对于中文支持非常不友好,那么有没有可以支持中文的分词器呢?如果有,该如何使用呢?
第一个问题,万能的谷歌告诉了我结果,已经有了支持中文的分词器,而且是开源实现:IK Analysis for Elasticsearch,详见:。 秉着“拿来主义”不重复造轮子的指导思想,直接先拿过来使用一下,看看效果怎么样。那么,如何使用IK分词器呢?其实这是一个ES插件,直接安装并对ES进行相应的配置即可。安装IK分词器
我的ES版本为2.4.1,需要下载的IK版本为:1.10.1(注意:必须下载与ES版本对应的IK,否则不能使用)。
1.下载,编译IK
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.1/elasticsearch-analysis-ik-1.10.1.zipunzip elasticsearch-analysis-ik-1.10.1.zipcd elasticsearch-analysis-ik-1.10.1mvn clean package
在elasticsearch-analysis-ik-1.10.1\target\releases目录下生成打包文件:elasticsearch-analysis-ik-1.10.1.zip。
2.在ES中安装IK插件
将上述打包好的IK插件:elasticsearch-analysis-ik-1.10.1.zip拷贝到ES/plugins目录下,执行解压。
unzip elasticsearch-analysis-ik-1.10.1.ziprm -rf elasticsearch-analysis-ik-1.10.1.zip # 解压完之后一定要删除这个zip包,否则在启动ES时报错
重启ES。
使用IK分词器
安装IK分词器完毕之后,就可以在ES使用了。
第一步:新建index
PUT http://192.168.80.133:9200/my_index_name
第二步:给将来要使用的doc字段添加mapping
在这里我在ES中存储的doc格式如下:{ "nagtive_kw": [] "is_all": false, "emotion": 0, "focuce": false, "keywords": ["动力","外观","油耗"], // 在keywords字段上进行聚合分析 "source": "汽车之家", "time": -1, "machine_emotion": 0, "title": "no title", "spider": "qczj_index", "content": {}, "url": "http://xxx", "brand": "宝马", "series": "宝马1系", "model": "2017款"}
需要在keywords字段上进行聚合分析,所以给keywords字段添加mapping设置:
POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping{ "properties": { "keywords": { # 设置keywords字段使用ik分词器 "type": "string", "store": "no", "analyzer": "ik_smart", "search_analyzer": "ik_smart", "boost": 8 } }}
注意: 在设置mapping时有一个小插曲,我根据IK的官网设置“keywords”的type为“text”时报错:
POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping{ "properties": { "keywords": { "type": "text", # text类型在2.4.1版本中不支持 "store": "no", "analyzer": "ik_smart", "search_analyzer": "ik_smart", "boost": 8 } }}
报错:
{ "error": { "root_cause": [ { "type": "mapper_parsing_exception", "reason": "No handler for type [text] declared on field [keywords]" } ], "type": "mapper_parsing_exception", "reason": "No handler for type [text] declared on field [keywords]" }, "status": 400}
这是因为我使用的ES版本比较低:2.4.1,而text
类型是ES5.0之后才添加的类型,所以不支持。在ES2.4.1版本中需要使用string
类型。
第三步:添加doc对象
POST http://192.168.80.133:9200/my_index_name/my_type_name/{ "nagtive_kw": ["动力","外观","油耗"] "is_all": false, "emotion": 0, "focuce": false, "keywords": ["动力","外观","油耗"], // 在keywords字段上进行聚合分析 "source": "汽车之家", "time": -1, "machine_emotion": 0, "title": "从动次打次吃大餐", "spider": "qczj_index", "content": {}, "url": "http://xxx", "brand": "宝马", "series": "宝马1系", "model": "2017款"}
第四步:聚合分析
POST http://192.168.80.133:9200/my_index_name/my_type_name/_search{ "size": 0, "query" : { "range" : { "time": { "gte": 1513778040000, "lte": 1513848720000 } } }, "aggs": { "keywords": { "terms": {"field": "keywords"}, "aggs": { "emotions": { "terms": {"field": "emotion"} } } } }}
输出结果:
{ "took": 22, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 32, "max_score": 0.0, "hits": [] }, "aggregations": { "keywords": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "动力", # 完整的词没有被拆分为独立的汉字 "doc_count": 2, "emotions": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": -1, "doc_count": 1 }, { "key": 0, "doc_count": 1 } ] } } ] } }}
【参考】
如何在Elasticsearch中安装中文分词器(IK+pinyin) 关于聚合(aggs)的问题 create map时出现No handler for type [text] declared on field [content] #276 Elasticsearch2.4学习(三)------Elasticsearch2.4插件安装详解