[ 
https://issues.apache.org/jira/browse/HIVE-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171186#comment-15171186
 ] 

Rajdeep Surolia commented on HIVE-9689:
---------------------------------------

Hi Prasanth,
I am a Computer Science undergrad student from Kolkata, India. I program in 
C/C++ and JAVA. I have a pretty good knowledge about the workings of Apache 
Hadoop and Hive. I am very much interested in Big Data and would like to work 
on projects related to it. I too am a GSoC '16 aspirant and this looks like the 
right project. I would like to know the prerequisites for this project. Your 
help will be much appreciated.

Cheers,
Rajdeep


> Store histograms and distinct value estimator's bit vectors in metastore
> ------------------------------------------------------------------------
>
>                 Key: HIVE-9689
>                 URL: https://issues.apache.org/jira/browse/HIVE-9689
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Prasanth Jayachandran
>              Labels: gsoc, gsoc2015, hive, java
>
> Hive currently uses PCSA (Probabilistic Counting and Stochastic Averaging) 
> algorithm to determine distinct cardinality. The NDV value determined from 
> the UDF is stored in the metastore instead of the actual bit vectors. This 
> makes it impossible to estimate the overall NDV across all the partitions (or 
> selected partitions). We should ideally store the bitvectors in the metastore 
> and do server side merging of the bitvectors. Also we could replace the 
> current PCSA algorithm in favour of HyperLogLog if space is a constraint. 
> Also Hive has a UDF for computing histogram. We can persist the histogram in 
> the metastore so that hive optimizer can make better decisions. Also having 
> histograms in metastore can help with order by, skew join and count distinct 
> + group by optimizations.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to