[ https://issues.apache.org/jira/browse/HIVE-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171186#comment-15171186 ]
Rajdeep Surolia commented on HIVE-9689: --------------------------------------- Hi Prasanth, I am a Computer Science undergrad student from Kolkata, India. I program in C/C++ and JAVA. I have a pretty good knowledge about the workings of Apache Hadoop and Hive. I am very much interested in Big Data and would like to work on projects related to it. I too am a GSoC '16 aspirant and this looks like the right project. I would like to know the prerequisites for this project. Your help will be much appreciated. Cheers, Rajdeep > Store histograms and distinct value estimator's bit vectors in metastore > ------------------------------------------------------------------------ > > Key: HIVE-9689 > URL: https://issues.apache.org/jira/browse/HIVE-9689 > Project: Hive > Issue Type: New Feature > Reporter: Prasanth Jayachandran > Labels: gsoc, gsoc2015, hive, java > > Hive currently uses PCSA (Probabilistic Counting and Stochastic Averaging) > algorithm to determine distinct cardinality. The NDV value determined from > the UDF is stored in the metastore instead of the actual bit vectors. This > makes it impossible to estimate the overall NDV across all the partitions (or > selected partitions). We should ideally store the bitvectors in the metastore > and do server side merging of the bitvectors. Also we could replace the > current PCSA algorithm in favour of HyperLogLog if space is a constraint. > Also Hive has a UDF for computing histogram. We can persist the histogram in > the metastore so that hive optimizer can make better decisions. Also having > histograms in metastore can help with order by, skew join and count distinct > + group by optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)