在使用elasticsearch时有时候会有需要用terms来做聚合运算的需求。但是有时候返回的结果会有误差。比方说有需求要根据某一个id号分类统计一段时间内该id号出现的个数,然后根据出现的个数做排序。在这种情况下,如果这个id号近似于主键,几乎不唯一,各不相同。而且出现的频率在绝大数情况下差不多或者近乎一致,偶尔会有几个不同(我不知道用数学的术语这种情况叫什么,好像是方差还是协方差,代表离散性低吧,貌似)。那么在这种情况下做统计出现个数再排序就会有出现结果有误差的情况。
这个时候es就提供了两个参数。size和shard size两个参数。由于es是分布式的,所以聚合运算都是根据分片来各自做各自的运算,最终再聚合成结果。譬如说 我们需要的结果size为10,默认的shard size 为100
The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.
。如果我们每个分片的结果中有个别结果没有体现在前100的结果内,但是如果将所有分片的结果一起统计汇总的话,真实的结果是应该在前10的。如果出现这种情况,那么汇聚后的最终结果就失真了,那么我们可以提高shard size的大小来提升整个结果的精确性。
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the significant_terms aggregation can produce higher-quality results when the shard_size parameter is set to values significantly higher than the size setting. This ensures that a bigger volume of promising candidate terms are given a consolidated review by the reducing node before the final selection. Obviously large candidate term lists will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.
python代码中可以这样定义
def GetTempCount(): sresult = es.search( index="temp-*", body={ "query" : { "bool" : { "must" : { "query_string" : { "analyze_wildcard" : "true", "query" : "'some words' AND _exists_:tempid" } }, "filter" : { "range" : { "@timestamp" : { "gte": "now-24h" } } } } }, "aggs" : { "tempcount" : { "terms" : { "field" : "tempid.keyword", "size" :10, "shard_size" :300000, "order":{"_count":"desc"} } } } } ) return sresult["aggregations"]["tempcount"]["buckets"]
Cloudhu 个人随笔|built by django|
沪ICP备16019452号-1