Hi,
I've implemented jobs using datasketches in Kafka Streams, Flink streaming,
and in Spark batch (through the Hive UDFs provided). Things went smoothly
in all setups, with the gotcha that hive UDFs represent incoming strings as
utf-8 byte arrays (or something like that, i forgot by now), so if you're
mixing sketches from two sources (Kafka Streams + Spark batch in my case)
you have to take care to cast the input items to proper types before adding
them to sketches.

A mailing list thread concerning that issue:
http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser
(thread continues into September)

Regards,
Marko

On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jmal...@apache.org> wrote:

> I'll echo what Ben said -- if a pre-existing solution does what you need,
> certainly use that.
>
> Having said that, I want to revisit frequent directions in light of the
> work Charlie did on using it for ridge regression. And when I asked
> internally I was told that Flink is where at least my company seems to be
> going for such jobs. So when I get a chance to dive into that, I'll be
> learning how to do it in Flink.
>
>   jon
>
> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <ben.k...@imply.io> wrote:
>
>> I can't answer about Spark or Flink, but as a druid person, I'll put in a
>> plug for druid for the "if necessary" case.  It can ingest from kafka and
>> aggregate and do sketches during ingestion.  (It's a whole new ballpark,
>> though, if you're not already using it.)
>>
>> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <agarl...@expediagroup.com>
>> wrote:
>>
>>> Hi
>>>
>>>
>>>
>>> New to DataSketches and looking forward to using, seems like a great
>>> library.
>>>
>>>
>>>
>>> My team are evaluating it to profile streaming data (in Kafka) in
>>> 5-minute windows. The obvious options for stream processing (given
>>> experience within our org) would be either Flink or Spark Streaming.
>>>
>>>
>>>
>>> Two questions:
>>>
>>>    - Would I be right in thinking that there are not existing
>>>    integrations as libraries for either of these platforms? Absolutely fine 
>>> if
>>>    not, just confirming understanding.
>>>    - Is there any view (from either the maintainers or the wider
>>>    community) on whether either of those two are easier to integrate with
>>>    DataSketches? We would also consider other streaming platforms if
>>>    necessary, but as mentioned wider usage within the org would lean us
>>>    against that if at all possible.
>>>
>>>
>>>
>>> Many thanks
>>>
>>

Reply via email to