Re: Roaring Bitmap UDFs

2017-12-11 Thread Prasanth Jayachandran
Are you trying to add HLL UDAF for hive? If so recent versions of Hive already has an implementation of HLL++ which does not need bitset. https://github.com/apache/hive/tree/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/common/ndv/hll Also the bloom filter implementation in hiv

Re: Roaring Bitmap UDFs

2017-12-11 Thread Nitin Vijayvargiya
Hi Prasanth, Thanks, that was exactly what I was looking for. My main concern is speed, so I tried going with the brickhouse implementation of HLL+, and ended up having to make minor modifications to the code in order to have it run. My only concern is that the precision check tests don't always pa

Re: Roaring Bitmap UDFs

2017-12-11 Thread Prasanth Jayachandran
I did performance benchmark for roaring bitmaps when I added bloomfilters (hyperloglog also shares the same bitset impl) to Orc and Hive. I found that roaring bitmap is good at compression at the cost of speed. In a JMH benchmark, observed around ~10x slowdown during insert and probe when using

Re: Roaring Bitmap UDFs

2017-12-11 Thread Nitin Vijayvargiya
Hi David, Thanks for the response. Yea, bloom filters are mostly for existential checks. I'm looking for a way to preprocess data, and then perform operations like union/intersection between them to find counts. Example: Number of distinct users visiting website A over the last 5 days (union), inte