Re: HyperLogLogUDT

2015-09-13 Thread Yin Huai
gt; opinions >>>>>>> on this? >>>>>>> >>>>>>> Kind regards, >>>>>>> >>>>>>> Herman van Hövell tot Westerflier >>>>>>> >>>>>>> QuestTec B.V. >>>&g

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
orp >>>>>> hvanhov...@questtec.nl >>>>>> +599 9 521 4402 >>>>>> >>>>>> >>>>>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath : >>>>>> >>>>>>> Inspired by this

Re: HyperLogLogUDT

2015-09-12 Thread Yin Huai
: >>>>> >>>>>> Inspired by this post: >>>>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/, >>>>>> I've started putting together something based on the Spark 1.5 UDAF

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141 >>>>> >>>>> Some questions - >>>>> >>>>> 1. How do I get the UDAF to accept input arguments of different type? >>>>> We can hash anything basical

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
seems we'd need to build a new UDAF for each input >>>> type, which seems strange - I should be able to use one UDAF that can >>>> handle raw input of different types, as well as handle existing HLLs that >>>> can be merged/aggregated (e.g. for grouped dat

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
ould I ensure this works for Tungsten (ie against raw >>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that? >>> Where should I look for examples on how this works internally? >>> 3. I've based this on the Sum and Avg examples for the new UDAF >>>

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
HLLs that can be >> merged/aggregated (e.g. for grouped data) >> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw >> bytes in memory)? Or does the new Aggregate2 stuff automatically do that? >> Where should I look for examples on how this works internally? >>

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
r examples on how this works internally? > 3. I've based this on the Sum and Avg examples for the new UDAF interface > - any suggestions or issue please advise. Is the intermediate buffer > efficient? > 4. The current HyperLogLogUDT is private - so I've had to make my own one

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
ernally? 3. I've based this on the Sum and Avg examples for the new UDAF interface - any suggestions or issue please advise. Is the intermediate buffer efficient? 4. The current HyperLogLogUDT is private - so I've had to make my own one which is a bit pointless as it's copy-pasted. Any

Re: HyperLogLogUDT

2015-07-01 Thread Reynold Xin
also very >>>> easily do arbitrary aggregates (say monthly, annually) and still be able to >>>> get a unique count for that period by merging the daily HLLS. >>>> >>>> I did this a while back as a Hive UDAF ( >>>> https://github.com/ML

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
annually) and still be able to get a >>> unique count for that period by merging the daily HLLS. >>> >>> I did this a while back as a Hive UDAF ( >>> https://github.com/MLnick/hive-udf) which returns a Struct field >>> containing a "cardinality"

Re: HyperLogLogUDT

2015-07-01 Thread Daniel Darabos
"cardinality" field and a "binary" field containing the >> serialized HLL. >> >> I was wondering if there would be interest in something like this? I am >> not so clear on how UDTs work with regards to SerDe - so could one adapt >&

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
t > so clear on how UDTs work with regards to SerDe - so could one adapt the > HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as > count as a field? Then I assume this would automatically play nicely with > DataFrame I/O etc. The gotcha is one needs to then call

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
ry" field containing the serialized HLL. I was wondering if there would be interest in something like this? I am not so clear on how UDTs work with regards to SerDe - so could one adapt the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count as a field?