[ 
https://issues.apache.org/jira/browse/HIVE-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-12763:
-----------------------------------
    Attachment: aggrStatsPerformance.png

as per [~jpullokkaran]'s request, I tested the time/space complexity of 
aggrStats performance on my mac. The x-axis is the #partitions. y-axis is the 
time take to aggregate the stats of #partitions in ms. We can see that as 
#partition increases, the aggrStats time increases. But it runs quite fast, 
475ms for 1000 partitions. I can not go beyond 1000 as my mac dies after I 
increase it to 2000. Thus, the time complexity is pretty good mainly due to the 
simple operation that we have (bit or). The space complexity is also good. For 
16 bit vectors, each bit vector is an array of at most 31 integers. And then 
multiply by the number of partitions. In an extreme case, 1 million partition, 
the total space is 16*31*4B*1M (around 2GB). This is the space we need when we 
want to store every bit vector in HBaseStore (without consideration of 
serialization). When we aggregate the partition stats one by one, we need the 
memory of 16*31*4B*2 (around 4KB).

> Use bit vector to track NDV
> ---------------------------
>
>                 Key: HIVE-12763
>                 URL: https://issues.apache.org/jira/browse/HIVE-12763
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pengcheng Xiong
>            Assignee: Pengcheng Xiong
>         Attachments: HIVE-12763.01.patch, HIVE-12763.02.patch, 
> HIVE-12763.03.patch, HIVE-12763.04.patch, HIVE-12763.05.patch, 
> aggrStatsPerformance.png
>
>
> This will improve merging of per partitions stats. It will also help merge 
> NDV for auto-gather column stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to