[ https://issues.apache.org/jira/browse/HIVE-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pengcheng Xiong updated HIVE-12763: ----------------------------------- Attachment: aggrStatsPerformance.png as per [~jpullokkaran]'s request, I tested the time/space complexity of aggrStats performance on my mac. The x-axis is the #partitions. y-axis is the time take to aggregate the stats of #partitions in ms. We can see that as #partition increases, the aggrStats time increases. But it runs quite fast, 475ms for 1000 partitions. I can not go beyond 1000 as my mac dies after I increase it to 2000. Thus, the time complexity is pretty good mainly due to the simple operation that we have (bit or). The space complexity is also good. For 16 bit vectors, each bit vector is an array of at most 31 integers. And then multiply by the number of partitions. In an extreme case, 1 million partition, the total space is 16*31*4B*1M (around 2GB). This is the space we need when we want to store every bit vector in HBaseStore (without consideration of serialization). When we aggregate the partition stats one by one, we need the memory of 16*31*4B*2 (around 4KB). > Use bit vector to track NDV > --------------------------- > > Key: HIVE-12763 > URL: https://issues.apache.org/jira/browse/HIVE-12763 > Project: Hive > Issue Type: Improvement > Reporter: Pengcheng Xiong > Assignee: Pengcheng Xiong > Attachments: HIVE-12763.01.patch, HIVE-12763.02.patch, > HIVE-12763.03.patch, HIVE-12763.04.patch, HIVE-12763.05.patch, > aggrStatsPerformance.png > > > This will improve merging of per partitions stats. It will also help merge > NDV for auto-gather column stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332)