gt; opinions
>>>>>>> on this?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Herman van Hövell tot Westerflier
>>>>>>>
>>>>>>> QuestTec B.V.
>>>&g
orp
>>>>>> hvanhov...@questtec.nl
>>>>>> +599 9 521 4402
>>>>>>
>>>>>>
>>>>>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath :
>>>>>>
>>>>>>> Inspired by this
:
>>>>>
>>>>>> Inspired by this post:
>>>>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>>>>>> I've started putting together something based on the Spark 1.5 UDAF
>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>>>>
>>>>> Some questions -
>>>>>
>>>>> 1. How do I get the UDAF to accept input arguments of different type?
>>>>> We can hash anything basical
seems we'd need to build a new UDAF for each input
>>>> type, which seems strange - I should be able to use one UDAF that can
>>>> handle raw input of different types, as well as handle existing HLLs that
>>>> can be merged/aggregated (e.g. for grouped dat
ould I ensure this works for Tungsten (ie against raw
>>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>>> Where should I look for examples on how this works internally?
>>> 3. I've based this on the Sum and Avg examples for the new UDAF
>>>
HLLs that can be
>> merged/aggregated (e.g. for grouped data)
>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>> Where should I look for examples on how this works internally?
>>
r examples on how this works internally?
> 3. I've based this on the Sum and Avg examples for the new UDAF interface
> - any suggestions or issue please advise. Is the intermediate buffer
> efficient?
> 4. The current HyperLogLogUDT is private - so I've had to make my own one
ernally?
3. I've based this on the Sum and Avg examples for the new UDAF interface -
any suggestions or issue please advise. Is the intermediate buffer
efficient?
4. The current HyperLogLogUDT is private - so I've had to make my own one
which is a bit pointless as it's copy-pasted. Any
also very
>>>> easily do arbitrary aggregates (say monthly, annually) and still be able to
>>>> get a unique count for that period by merging the daily HLLS.
>>>>
>>>> I did this a while back as a Hive UDAF (
>>>> https://github.com/ML
annually) and still be able to get a
>>> unique count for that period by merging the daily HLLS.
>>>
>>> I did this a while back as a Hive UDAF (
>>> https://github.com/MLnick/hive-udf) which returns a Struct field
>>> containing a "cardinality"
"cardinality" field and a "binary" field containing the
>> serialized HLL.
>>
>> I was wondering if there would be interest in something like this? I am
>> not so clear on how UDTs work with regards to SerDe - so could one adapt
>&
t
> so clear on how UDTs work with regards to SerDe - so could one adapt the
> HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
> count as a field? Then I assume this would automatically play nicely with
> DataFrame I/O etc. The gotcha is one needs to then call
ry" field containing the serialized HLL.
I was wondering if there would be interest in something like this? I am not
so clear on how UDTs work with regards to SerDe - so could one adapt the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field?
14 matches
Mail list logo