Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Marko Mušnjak Thu, 08 Apr 2021 09:35:25 -0700

The basic streaming windowed aggregations (in the Java/Scala API,
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction)
don't require the retract method, but it looks like the SQL/Table API
requires retract support for aggregate functions, if they need to be used
in OVER windows:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html#mandatory-and-optional-methods



Marko

On Thu, 8 Apr 2021 at 17:56, Will Lauer <wla...@verizonmedia.com> wrote:

> Last time I looked at the Flink API for implementing aggregators, it
> looked like it required a "decrement" function to remove entries from the
> aggregate in addition to the standard "aggregate" function to add entries
> to the aggregate. The documentation was unclear, but it looked like this
> was a requirement for getting streaming, windowed operations to work. I
> don't know if you are going to need that functionality for your Kappa
> architecture, but you should double check, as implementing such
> functionality with sketches may be impossible, and might pose a blocker for
> you.
>
> It's possible that this has been resolved now. Its equally possible that I
> completely misunderstood the flink doc, but its something that jumped out
> at me and made be very nervous about reimplementing our current PIG based
> pipelinein Flink due to our use of sketches.
>
> Will
>
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <agarl...@expediagroup.com>
> wrote:
>
>> Thanks all very much for the responses so far. Definitely useful but I
>> think it might help to narrow focus if I explain a little more context of
>> what we are trying to do.
>>
>>
>>
>> Firstly, we want to emit the profile metrics as a stream (Kafka topic) as
>> well, which I assume would mean we wouldn’t want to use Druid (which is
>> more in the spirit of a next-gen/ low-latency analytics DB if I understand
>> correctly?)
>>
>>
>>
>> We are definitely interested in Flink as it looks like this may be a good
>> route to create a Kappa architecture with a single set of code handling
>> profiling of batch and stream data sets. Appreciate some of the following
>> may be a bit more about Flink than DataSketches per se but will post for
>> the record.
>>
>>
>>
>> I started looking at the Table/ SQL API as this seems to be something
>> that is being encouraged for Kappa use cases. It looked like the required
>> interface for user-defined aggregate functions in Flink SQL should allow
>> wrapping of the Sketch objects as accumulators, but when we tried this in
>> practice we got issues – Flink can’t extract a data type for CpCSketch, at
>> least partly due to it having private fields (i.e. seed).
>>
>>
>>
>> We’re next looking at whether this is easier using the DataStreams API,
>> if anyone can confirm the following it would be useful:
>>
>>    - Would I be right in thinking that where other people have
>>    integrated Flink and DataSketches it has been using DataStreams API?
>>    - Are there any good code examples publicly available (GitHub?) that
>>    might help steer/ validate our approach?
>>
>>
>>
>> In the longer term (later this year), one option we might consider is
>> creating an OSS configurable library/ framework for running checks based on
>> DataSketches in Flink (we also need to see whether for example Bullet
>> already covers a lot of what we need in terms of setting up stream
>> queries). If anyone else feels there is a gap and might be interested in
>> collaborating, please let me know and I can publish more details of what
>> we’re proposing if and when that evolves.
>>
>>
>>
>> Many thanks
>>
>>
>>
>>
>>
>> *From: *Marko Mušnjak <marko.musn...@gmail.com>
>> *Date: *Tuesday, 6 April 2021 at 20:21
>> *To: *users@datasketches.apache.org <users@datasketches.apache.org>
>> *Subject: *[External] Re: Choice of Flink vs Spark for using
>> DataSketches with streaming data
>>
>> Hi,
>>
>> I've implemented jobs using datasketches in Kafka Streams, Flink
>> streaming, and in Spark batch (through the Hive UDFs provided). Things went
>> smoothly in all setups, with the gotcha that hive UDFs represent incoming
>> strings as utf-8 byte arrays (or something like that, i forgot by now), so
>> if you're mixing sketches from two sources (Kafka Streams + Spark batch in
>> my case) you have to take care to cast the input items to proper types
>> before adding them to sketches.
>>
>>
>>
>> A mailing list thread concerning that issue:
>> http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=>
>> (thread continues into September)
>>
>>
>>
>> Regards,
>>
>> Marko
>>
>>
>>
>> On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jmal...@apache.org> wrote:
>>
>> I'll echo what Ben said -- if a pre-existing solution does what you need,
>> certainly use that.
>>
>>
>>
>> Having said that, I want to revisit frequent directions in light of the
>> work Charlie did on using it for ridge regression. And when I asked
>> internally I was told that Flink is where at least my company seems to be
>> going for such jobs. So when I get a chance to dive into that, I'll be
>> learning how to do it in Flink.
>>
>>
>>
>>   jon
>>
>>
>>
>> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <ben.k...@imply.io> wrote:
>>
>> I can't answer about Spark or Flink, but as a druid person, I'll put in a
>> plug for druid for the "if necessary" case.  It can ingest from kafka and
>> aggregate and do sketches during ingestion.  (It's a whole new ballpark,
>> though, if you're not already using it.)
>>
>>
>>
>> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <agarl...@expediagroup.com>
>> wrote:
>>
>> Hi
>>
>>
>>
>> New to DataSketches and looking forward to using, seems like a great
>> library.
>>
>>
>>
>> My team are evaluating it to profile streaming data (in Kafka) in
>> 5-minute windows. The obvious options for stream processing (given
>> experience within our org) would be either Flink or Spark Streaming.
>>
>>
>>
>> Two questions:
>>
>>    - Would I be right in thinking that there are not existing
>>    integrations as libraries for either of these platforms? Absolutely fine 
>> if
>>    not, just confirming understanding.
>>    - Is there any view (from either the maintainers or the wider
>>    community) on whether either of those two are easier to integrate with
>>    DataSketches? We would also consider other streaming platforms if
>>    necessary, but as mentioned wider usage within the org would lean us
>>    against that if at all possible.
>>
>>
>>
>> Many thanks
>>
>>

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Reply via email to