Re: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Alex Garland Thu, 08 Apr 2021 10:30:15 -0700

Thanks Will and Marko

I don’t think we need to decrement/ retract values for any reason, and our 
requirements were we to use Flink SQL would not currently involve the OVER 
syntax.


It seems today like we’ve managed to get DataSketches CPC sketch integrated 
okay with an aggregate function in the vanilla Java (DataStreams) Flink API, 
which would at least getting things working for streaming use cases if not 
completely smoothly for Kappa. We will be doing some further testing to confirm 
that when we properly distribute the process (versus local integration test) 
then for example serialization doesn’t throw out any surprises.

Given that the above seems to work, I think (still TBC) the issue with using in 
Flink SQL may be around missing TypeInformation and-or hard constraints on what 
types of classes can be used as accumulators in the SQL abstraction.

Will try and find time to update this thread further on testing outcomes (and 
also if we do eventually get the SQL approach working).


From: Marko Mušnjak <marko.musn...@gmail.com>
Date: Thursday, 8 April 2021 at 17:35
To: users@datasketches.apache.org <users@datasketches.apache.org>
Subject: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches 
with streaming data
The basic streaming windowed aggregations (in the Java/Scala API, 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction)
 don't require the retract method, but it looks like the SQL/Table API requires 
retract support for aggregate functions, if they need to be used in OVER 
windows: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html#mandatory-and-optional-methods

Marko

On Thu, 8 Apr 2021 at 17:56, Will Lauer 
<wla...@verizonmedia.com<mailto:wla...@verizonmedia.com>> wrote:
Last time I looked at the Flink API for implementing aggregators, it looked 
like it required a "decrement" function to remove entries from the aggregate in 
addition to the standard "aggregate" function to add entries to the aggregate. 
The documentation was unclear, but it looked like this was a requirement for 
getting streaming, windowed operations to work. I don't know if you are going 
to need that functionality for your Kappa architecture, but you should double 
check, as implementing such functionality with sketches may be impossible, and 
might pose a blocker for you.

It's possible that this has been resolved now. Its equally possible that I 
completely misunderstood the flink doc, but its something that jumped out at me 
and made be very nervous about reimplementing our current PIG based pipelinein 
Flink due to our use of sketches.

Will


[https://s.yimg.com/cv/apiv2/default/20190416/verizonmedia_emailsig_400x89.jpg]<http://www.verizonmedia.com/>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822
[http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-facebook?$defaultscale$]<http://www.facebook.com/verizonmedia>
  [http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-twitter?$defaultscale$] 
<http://twitter.com/verizonmedia>   
[http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-linkedin?$defaultscale$] 
<https://www.linkedin.com/company/verizon-media/>   
[http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-instagram?$defaultscale$] 
<http://www.instagram.com/verizonmedia>


On Thu, Apr 8, 2021 at 3:43 AM Alex Garland 
<agarl...@expediagroup.com<mailto:agarl...@expediagroup.com>> wrote:
Thanks all very much for the responses so far. Definitely useful but I think it 
might help to narrow focus if I explain a little more context of what we are 
trying to do.

Firstly, we want to emit the profile metrics as a stream (Kafka topic) as well, 
which I assume would mean we wouldn’t want to use Druid (which is more in the 
spirit of a next-gen/ low-latency analytics DB if I understand correctly?)

We are definitely interested in Flink as it looks like this may be a good route 
to create a Kappa architecture with a single set of code handling profiling of 
batch and stream data sets. Appreciate some of the following may be a bit more 
about Flink than DataSketches per se but will post for the record.

I started looking at the Table/ SQL API as this seems to be something that is 
being encouraged for Kappa use cases. It looked like the required interface for 
user-defined aggregate functions in Flink SQL should allow wrapping of the 
Sketch objects as accumulators, but when we tried this in practice we got 
issues – Flink can’t extract a data type for CpCSketch, at least partly due to 
it having private fields (i.e. seed).

We’re next looking at whether this is easier using the DataStreams API, if 
anyone can confirm the following it would be useful:

  *   Would I be right in thinking that where other people have integrated 
Flink and DataSketches it has been using DataStreams API?
  *   Are there any good code examples publicly available (GitHub?) that might 
help steer/ validate our approach?

In the longer term (later this year), one option we might consider is creating 
an OSS configurable library/ framework for running checks based on DataSketches 
in Flink (we also need to see whether for example Bullet already covers a lot 
of what we need in terms of setting up stream queries). If anyone else feels 
there is a gap and might be interested in collaborating, please let me know and 
I can publish more details of what we’re proposing if and when that evolves.

Many thanks


From: Marko Mušnjak <marko.musn...@gmail.com<mailto:marko.musn...@gmail.com>>
Date: Tuesday, 6 April 2021 at 20:21
To: users@datasketches.apache.org<mailto:users@datasketches.apache.org> 
<users@datasketches.apache.org<mailto:users@datasketches.apache.org>>
Subject: [External] Re: Choice of Flink vs Spark for using DataSketches with 
streaming data
Hi,
I've implemented jobs using datasketches in Kafka Streams, Flink streaming, and 
in Spark batch (through the Hive UDFs provided). Things went smoothly in all 
setups, with the gotcha that hive UDFs represent incoming strings as utf-8 byte 
arrays (or something like that, i forgot by now), so if you're mixing sketches 
from two sources (Kafka Streams + Spark batch in my case) you have to take care 
to cast the input items to proper types before adding them to sketches.

A mailing list thread concerning that issue: 
http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser<https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=>
 (thread continues into September)

Regards,
Marko

On Tue, 6 Apr 2021 at 20:55, Jon Malkin 
<jmal...@apache.org<mailto:jmal...@apache.org>> wrote:
I'll echo what Ben said -- if a pre-existing solution does what you need, 
certainly use that.

Having said that, I want to revisit frequent directions in light of the work 
Charlie did on using it for ridge regression. And when I asked internally I was 
told that Flink is where at least my company seems to be going for such jobs. 
So when I get a chance to dive into that, I'll be learning how to do it in 
Flink.

  jon

On Tue, Apr 6, 2021 at 11:26 AM Ben Krug 
<ben.k...@imply.io<mailto:ben.k...@imply.io>> wrote:
I can't answer about Spark or Flink, but as a druid person, I'll put in a plug 
for druid for the "if necessary" case.  It can ingest from kafka and aggregate 
and do sketches during ingestion.  (It's a whole new ballpark, though, if 
you're not already using it.)

On Tue, Apr 6, 2021 at 9:56 AM Alex Garland 
<agarl...@expediagroup.com<mailto:agarl...@expediagroup.com>> wrote:
Hi

New to DataSketches and looking forward to using, seems like a great library.

My team are evaluating it to profile streaming data (in Kafka) in 5-minute 
windows. The obvious options for stream processing (given experience within our 
org) would be either Flink or Spark Streaming.

Two questions:

  *   Would I be right in thinking that there are not existing integrations as 
libraries for either of these platforms? Absolutely fine if not, just 
confirming understanding.
  *   Is there any view (from either the maintainers or the wider community) on 
whether either of those two are easier to integrate with DataSketches? We would 
also consider other streaming platforms if necessary, but as mentioned wider 
usage within the org would lean us against that if at all possible.

Many thanks

Re: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Reply via email to