Re: Choice of Flink vs Spark for using DataSketches with streaming data

Alex Garland Thu, 08 Apr 2021 01:43:48 -0700

Thanks all very much for the responses so far. Definitely useful but I think it 
might help to narrow focus if I explain a little more context of what we are 
trying to do.


Firstly, we want to emit the profile metrics as a stream (Kafka topic) as well, 
which I assume would mean we wouldn’t want to use Druid (which is more in the 
spirit of a next-gen/ low-latency analytics DB if I understand correctly?)

We are definitely interested in Flink as it looks like this may be a good route 
to create a Kappa architecture with a single set of code handling profiling of 
batch and stream data sets. Appreciate some of the following may be a bit more 
about Flink than DataSketches per se but will post for the record.

I started looking at the Table/ SQL API as this seems to be something that is 
being encouraged for Kappa use cases. It looked like the required interface for 
user-defined aggregate functions in Flink SQL should allow wrapping of the 
Sketch objects as accumulators, but when we tried this in practice we got 
issues – Flink can’t extract a data type for CpCSketch, at least partly due to 
it having private fields (i.e. seed).

We’re next looking at whether this is easier using the DataStreams API, if 
anyone can confirm the following it would be useful:

  *   Would I be right in thinking that where other people have integrated 
Flink and DataSketches it has been using DataStreams API?
  *   Are there any good code examples publicly available (GitHub?) that might 
help steer/ validate our approach?

In the longer term (later this year), one option we might consider is creating 
an OSS configurable library/ framework for running checks based on DataSketches 
in Flink (we also need to see whether for example Bullet already covers a lot 
of what we need in terms of setting up stream queries). If anyone else feels 
there is a gap and might be interested in collaborating, please let me know and 
I can publish more details of what we’re proposing if and when that evolves.

Many thanks


From: Marko Mušnjak <marko.musn...@gmail.com>
Date: Tuesday, 6 April 2021 at 20:21
To: users@datasketches.apache.org <users@datasketches.apache.org>
Subject: [External] Re: Choice of Flink vs Spark for using DataSketches with 
streaming data
Hi,
I've implemented jobs using datasketches in Kafka Streams, Flink streaming, and 
in Spark batch (through the Hive UDFs provided). Things went smoothly in all 
setups, with the gotcha that hive UDFs represent incoming strings as utf-8 byte 
arrays (or something like that, i forgot by now), so if you're mixing sketches 
from two sources (Kafka Streams + Spark batch in my case) you have to take care 
to cast the input items to proper types before adding them to sketches.

A mailing list thread concerning that issue: 
http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser 
(thread continues into September)

Regards,
Marko

On Tue, 6 Apr 2021 at 20:55, Jon Malkin 
<jmal...@apache.org<mailto:jmal...@apache.org>> wrote:
I'll echo what Ben said -- if a pre-existing solution does what you need, 
certainly use that.

Having said that, I want to revisit frequent directions in light of the work 
Charlie did on using it for ridge regression. And when I asked internally I was 
told that Flink is where at least my company seems to be going for such jobs. 
So when I get a chance to dive into that, I'll be learning how to do it in 
Flink.

  jon

On Tue, Apr 6, 2021 at 11:26 AM Ben Krug 
<ben.k...@imply.io<mailto:ben.k...@imply.io>> wrote:
I can't answer about Spark or Flink, but as a druid person, I'll put in a plug 
for druid for the "if necessary" case.  It can ingest from kafka and aggregate 
and do sketches during ingestion.  (It's a whole new ballpark, though, if 
you're not already using it.)

On Tue, Apr 6, 2021 at 9:56 AM Alex Garland 
<agarl...@expediagroup.com<mailto:agarl...@expediagroup.com>> wrote:
Hi

New to DataSketches and looking forward to using, seems like a great library.

My team are evaluating it to profile streaming data (in Kafka) in 5-minute 
windows. The obvious options for stream processing (given experience within our 
org) would be either Flink or Spark Streaming.

Two questions:

  *   Would I be right in thinking that there are not existing integrations as 
libraries for either of these platforms? Absolutely fine if not, just 
confirming understanding.
  *   Is there any view (from either the maintainers or the wider community) on 
whether either of those two are easier to integrate with DataSketches? We would 
also consider other streaming platforms if necessary, but as mentioned wider 
usage within the org would lean us against that if at all possible.

Many thanks

Re: Choice of Flink vs Spark for using DataSketches with streaming data

Reply via email to