Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Will Lauer Thu, 08 Apr 2021 08:56:17 -0700

Last time I looked at the Flink API for implementing aggregators, it looked
like it required a "decrement" function to remove entries from the
aggregate in addition to the standard "aggregate" function to add entries
to the aggregate. The documentation was unclear, but it looked like this
was a requirement for getting streaming, windowed operations to work. I
don't know if you are going to need that functionality for your Kappa
architecture, but you should double check, as implementing such
functionality with sketches may be impossible, and might pose a blocker for
you.


It's possible that this has been resolved now. Its equally possible that I
completely misunderstood the flink doc, but its something that jumped out
at me and made be very nervous about reimplementing our current PIG based
pipelinein Flink due to our use of sketches.

Will


<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <agarl...@expediagroup.com>
wrote:

> Thanks all very much for the responses so far. Definitely useful but I
> think it might help to narrow focus if I explain a little more context of
> what we are trying to do.
>
>
>
> Firstly, we want to emit the profile metrics as a stream (Kafka topic) as
> well, which I assume would mean we wouldn’t want to use Druid (which is
> more in the spirit of a next-gen/ low-latency analytics DB if I understand
> correctly?)
>
>
>
> We are definitely interested in Flink as it looks like this may be a good
> route to create a Kappa architecture with a single set of code handling
> profiling of batch and stream data sets. Appreciate some of the following
> may be a bit more about Flink than DataSketches per se but will post for
> the record.
>
>
>
> I started looking at the Table/ SQL API as this seems to be something that
> is being encouraged for Kappa use cases. It looked like the required
> interface for user-defined aggregate functions in Flink SQL should allow
> wrapping of the Sketch objects as accumulators, but when we tried this in
> practice we got issues – Flink can’t extract a data type for CpCSketch, at
> least partly due to it having private fields (i.e. seed).
>
>
>
> We’re next looking at whether this is easier using the DataStreams API, if
> anyone can confirm the following it would be useful:
>
>    - Would I be right in thinking that where other people have integrated
>    Flink and DataSketches it has been using DataStreams API?
>    - Are there any good code examples publicly available (GitHub?) that
>    might help steer/ validate our approach?
>
>
>
> In the longer term (later this year), one option we might consider is
> creating an OSS configurable library/ framework for running checks based on
> DataSketches in Flink (we also need to see whether for example Bullet
> already covers a lot of what we need in terms of setting up stream
> queries). If anyone else feels there is a gap and might be interested in
> collaborating, please let me know and I can publish more details of what
> we’re proposing if and when that evolves.
>
>
>
> Many thanks
>
>
>
>
>
> *From: *Marko Mušnjak <marko.musn...@gmail.com>
> *Date: *Tuesday, 6 April 2021 at 20:21
> *To: *users@datasketches.apache.org <users@datasketches.apache.org>
> *Subject: *[External] Re: Choice of Flink vs Spark for using DataSketches
> with streaming data
>
> Hi,
>
> I've implemented jobs using datasketches in Kafka Streams, Flink
> streaming, and in Spark batch (through the Hive UDFs provided). Things went
> smoothly in all setups, with the gotcha that hive UDFs represent incoming
> strings as utf-8 byte arrays (or something like that, i forgot by now), so
> if you're mixing sketches from two sources (Kafka Streams + Spark batch in
> my case) you have to take care to cast the input items to proper types
> before adding them to sketches.
>
>
>
> A mailing list thread concerning that issue:
> http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=>
> (thread continues into September)
>
>
>
> Regards,
>
> Marko
>
>
>
> On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jmal...@apache.org> wrote:
>
> I'll echo what Ben said -- if a pre-existing solution does what you need,
> certainly use that.
>
>
>
> Having said that, I want to revisit frequent directions in light of the
> work Charlie did on using it for ridge regression. And when I asked
> internally I was told that Flink is where at least my company seems to be
> going for such jobs. So when I get a chance to dive into that, I'll be
> learning how to do it in Flink.
>
>
>
>   jon
>
>
>
> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <ben.k...@imply.io> wrote:
>
> I can't answer about Spark or Flink, but as a druid person, I'll put in a
> plug for druid for the "if necessary" case.  It can ingest from kafka and
> aggregate and do sketches during ingestion.  (It's a whole new ballpark,
> though, if you're not already using it.)
>
>
>
> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <agarl...@expediagroup.com>
> wrote:
>
> Hi
>
>
>
> New to DataSketches and looking forward to using, seems like a great
> library.
>
>
>
> My team are evaluating it to profile streaming data (in Kafka) in 5-minute
> windows. The obvious options for stream processing (given experience within
> our org) would be either Flink or Spark Streaming.
>
>
>
> Two questions:
>
>    - Would I be right in thinking that there are not existing
>    integrations as libraries for either of these platforms? Absolutely fine if
>    not, just confirming understanding.
>    - Is there any view (from either the maintainers or the wider
>    community) on whether either of those two are easier to integrate with
>    DataSketches? We would also consider other streaming platforms if
>    necessary, but as mentioned wider usage within the org would lean us
>    against that if at all possible.
>
>
>
> Many thanks
>
>

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Reply via email to