Are you trying to add HLL UDAF for hive? If so recent versions of Hive already 
has an implementation of HLL++ which does not need bitset.
https://github.com/apache/hive/tree/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/common/ndv/hll

Also the bloom filter implementation in hive uses the raw bitset backed by 
long[] which seems to be fastest.
https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java#L330

Thanks
Prasanth

On Dec 11, 2017, at 8:25 PM, Nitin Vijayvargiya 
<nitinvija...@gmail.com<mailto:nitinvija...@gmail.com>> wrote:

Hi Prasanth,

Thanks, that was exactly what I was looking for. My main concern is speed, so I 
tried going with the brickhouse implementation of HLL+, and ended up having to 
make minor modifications to the code in order to have it run. My only concern 
is that the precision check tests don't always pass. Given that I had to 
comment out the @Ignore("not ready yet") over the test class, I'm unsure 
whether this code is safe to use. Anyone else had this problem?

Thanks,
Nitin




On Mon, Dec 11, 2017 5:20 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com<mailto:pjayachand...@hortonworks.com> wrote:
I did performance benchmark for roaring bitmaps when I added bloomfilters 
(hyperloglog also shares the same bitset impl) to Orc and Hive.
I found that roaring bitmap is good at compression at the cost of speed. In a 
JMH benchmark, observed around ~10x slowdown during insert and probe when using 
roaring bitmap vs the raw bitset (long array without compression).
So it essentially boils down to speed vs space tradeoff.

Thanks
Prasanth

On Dec 11, 2017, at 3:58 PM, Nitin Vijayvargiya 
<nitinvija...@gmail.com<mailto:nitinvija...@gmail.com>> wrote:

Hi David,

Thanks for the response. Yea, bloom filters are mostly for existential checks. 
I'm looking for a way to preprocess data, and then perform operations like 
union/intersection between them to find counts. Example: Number of distinct 
users visiting website A over the last 5 days (union), intersected with the 
number of distinct visitors visiting website B over the last 10 days (union).

Hyperloglog is the right tool for this, but if someone has done performance 
benchmarking between HLL and Roaring BitMap, it would save me a lot of time.

Thanks,
Nitin




On Fri, Dec 8, 2017 7:08 PM, David Capwell 
dcapw...@gmail.com<mailto:dcapw...@gmail.com> wrote:
Think bloom filter that's more dynamic.  It works well when cardinality is low, 
but grows quickly to out cost bloom filter as cardinality grows.

This data structure supports existence queries, but your email sounds like you 
want count.  If so not really the best fit.

On Dec 8, 2017 5:00 PM, "Nitin Vijayvargiya" 
<nitinvija...@gmail.com<mailto:nitinvija...@gmail.com>> wrote:
Hi all,

I'm working on speeding up distinct count calculations, and it looks like 
roaring bitmaps (RB) is the newest and meanest way for set operations. Anyone 
here have experience with them? How was the performance compared to hyperloglog 
and EWAH? A quick google search showed me that it's easier to find UDF 
implementations of hyperloglog in presto and hive, but if the hype is real, it 
might be worth spending the time to incorporate RB. Also, if anyone can point 
me to reliable implementations of UDFs using RB, I would love to check it out 
and test it myself =)

Happy Holidays!

Nitin


Reply via email to