


2017-11-13 14:35 elk


这个时候es就提供了两个参数。size和shard size两个参数。由于es是分布式的,所以聚合运算都是根据分片来各自做各自的运算,最终再聚合成结果。譬如说 我们需要的结果size为10,默认的shard size 为100

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

。如果我们每个分片的结果中有个别结果没有体现在前100的结果内,但是如果将所有分片的结果一起统计汇总的话,真实的结果是应该在前10的。如果出现这种情况,那么汇聚后的最终结果就失真了,那么我们可以提高shard size的大小来提升整个结果的精确性。

Low-frequency terms can turn out to be the most interesting ones once all results are combined so the significant_terms aggregation can produce higher-quality results when the shard_size parameter is set to values significantly higher than the size setting. This ensures that a bigger volume of promising candidate terms are given a consolidated review by the reducing node before the final selection. Obviously large candidate term lists will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.


def GetTempCount():
    sresult = es.search(
            "query" : {
                "bool" : {
                    "must" : {
                        "query_string" : {
                            "analyze_wildcard" : "true",
                            "query" : "'some words' AND _exists_:tempid"
                    "filter" : {
                        "range" : {
                            "@timestamp" : {
                                "gte": "now-24h"
            "aggs" : {
                "tempcount" : {
                    "terms" : { 
                           "field" : "tempid.keyword",
                           "size" :10,
                           "shard_size" :300000,
    return sresult["aggregations"]["tempcount"]["buckets"]

Cloudhu 个人随笔|built by django|
