Re: Roaring Bitmap UDFs

2018-02-05 Thread Makoto Yui
Approximate Counting using HLL+ is supported in Apache Hivemall. http://hivemall.incubator.apache.org/userguide/misc/approx.html FYI Makoto -- Makoto YUI Research Engineer, Treasure Data, Inc. http://myui.github.io/

Re: Roaring Bitmap UDFs

2017-12-12 Thread Nitin Vijayvargiya
Sounds like you've done all the heavy lifting for me! Unfortunately, it looks like the feature was added in hive 3.0.0, so I'm not going to be able to leverage that yet. Also, I'll need to merge the HLL structures in presto for querying purposes, so I'll need to use the same HLL structure for prest

Re: Roaring Bitmap UDFs

2017-12-11 Thread Prasanth Jayachandran
Are you trying to add HLL UDAF for hive? If so recent versions of Hive already has an implementation of HLL++ which does not need bitset. https://github.com/apache/hive/tree/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/common/ndv/hll Also the bloom filter implementation in hiv

Re: Roaring Bitmap UDFs

2017-12-11 Thread Nitin Vijayvargiya
Hi Prasanth, Thanks, that was exactly what I was looking for. My main concern is speed, so I tried going with the brickhouse implementation of HLL+, and ended up having to make minor modifications to the code in order to have it run. My only concern is that the precision check tests don't always pa

Re: Roaring Bitmap UDFs

2017-12-11 Thread Prasanth Jayachandran
I did performance benchmark for roaring bitmaps when I added bloomfilters (hyperloglog also shares the same bitset impl) to Orc and Hive. I found that roaring bitmap is good at compression at the cost of speed. In a JMH benchmark, observed around ~10x slowdown during insert and probe when using

Re: Roaring Bitmap UDFs

2017-12-11 Thread Nitin Vijayvargiya
Hi David, Thanks for the response. Yea, bloom filters are mostly for existential checks. I'm looking for a way to preprocess data, and then perform operations like union/intersection between them to find counts. Example: Number of distinct users visiting website A over the last 5 days (union), inte

Re: Roaring Bitmap UDFs

2017-12-08 Thread David Capwell
Think bloom filter that's more dynamic. It works well when cardinality is low, but grows quickly to out cost bloom filter as cardinality grows. This data structure supports existence queries, but your email sounds like you want count. If so not really the best fit. On Dec 8, 2017 5:00 PM, "Niti

Roaring Bitmap UDFs

2017-12-08 Thread Nitin Vijayvargiya
Hi all, I'm working on speeding up distinct count calculations, and it looks like roaring bitmaps (RB) is the newest and meanest way for set operations. Anyone here have experience with them? How was the performance compared to hyperloglog and EWAH? A quick google search showed me that it's easier