The basic streaming windowed aggregations (in the Java/Scala API, https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction) don't require the retract method, but it looks like the SQL/Table API requires retract support for aggregate functions, if they need to be used in OVER windows: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html#mandatory-and-optional-methods
Marko On Thu, 8 Apr 2021 at 17:56, Will Lauer <wla...@verizonmedia.com> wrote: > Last time I looked at the Flink API for implementing aggregators, it > looked like it required a "decrement" function to remove entries from the > aggregate in addition to the standard "aggregate" function to add entries > to the aggregate. The documentation was unclear, but it looked like this > was a requirement for getting streaming, windowed operations to work. I > don't know if you are going to need that functionality for your Kappa > architecture, but you should double check, as implementing such > functionality with sketches may be impossible, and might pose a blocker for > you. > > It's possible that this has been resolved now. Its equally possible that I > completely misunderstood the flink doc, but its something that jumped out > at me and made be very nervous about reimplementing our current PIG based > pipelinein Flink due to our use of sketches. > > Will > > > <http://www.verizonmedia.com> > > Will Lauer > > Senior Principal Architect, Audience & Advertising Reporting > Data Platforms & Systems Engineering > > M 508 561 6427 > 1908 S. First St > Champaign, IL 61822 > > <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> > <https://www.linkedin.com/company/verizon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <agarl...@expediagroup.com> > wrote: > >> Thanks all very much for the responses so far. Definitely useful but I >> think it might help to narrow focus if I explain a little more context of >> what we are trying to do. >> >> >> >> Firstly, we want to emit the profile metrics as a stream (Kafka topic) as >> well, which I assume would mean we wouldn’t want to use Druid (which is >> more in the spirit of a next-gen/ low-latency analytics DB if I understand >> correctly?) >> >> >> >> We are definitely interested in Flink as it looks like this may be a good >> route to create a Kappa architecture with a single set of code handling >> profiling of batch and stream data sets. Appreciate some of the following >> may be a bit more about Flink than DataSketches per se but will post for >> the record. >> >> >> >> I started looking at the Table/ SQL API as this seems to be something >> that is being encouraged for Kappa use cases. It looked like the required >> interface for user-defined aggregate functions in Flink SQL should allow >> wrapping of the Sketch objects as accumulators, but when we tried this in >> practice we got issues – Flink can’t extract a data type for CpCSketch, at >> least partly due to it having private fields (i.e. seed). >> >> >> >> We’re next looking at whether this is easier using the DataStreams API, >> if anyone can confirm the following it would be useful: >> >> - Would I be right in thinking that where other people have >> integrated Flink and DataSketches it has been using DataStreams API? >> - Are there any good code examples publicly available (GitHub?) that >> might help steer/ validate our approach? >> >> >> >> In the longer term (later this year), one option we might consider is >> creating an OSS configurable library/ framework for running checks based on >> DataSketches in Flink (we also need to see whether for example Bullet >> already covers a lot of what we need in terms of setting up stream >> queries). If anyone else feels there is a gap and might be interested in >> collaborating, please let me know and I can publish more details of what >> we’re proposing if and when that evolves. >> >> >> >> Many thanks >> >> >> >> >> >> *From: *Marko Mušnjak <marko.musn...@gmail.com> >> *Date: *Tuesday, 6 April 2021 at 20:21 >> *To: *users@datasketches.apache.org <users@datasketches.apache.org> >> *Subject: *[External] Re: Choice of Flink vs Spark for using >> DataSketches with streaming data >> >> Hi, >> >> I've implemented jobs using datasketches in Kafka Streams, Flink >> streaming, and in Spark batch (through the Hive UDFs provided). Things went >> smoothly in all setups, with the gotcha that hive UDFs represent incoming >> strings as utf-8 byte arrays (or something like that, i forgot by now), so >> if you're mixing sketches from two sources (Kafka Streams + Spark batch in >> my case) you have to take care to cast the input items to proper types >> before adding them to sketches. >> >> >> >> A mailing list thread concerning that issue: >> http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=> >> (thread continues into September) >> >> >> >> Regards, >> >> Marko >> >> >> >> On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jmal...@apache.org> wrote: >> >> I'll echo what Ben said -- if a pre-existing solution does what you need, >> certainly use that. >> >> >> >> Having said that, I want to revisit frequent directions in light of the >> work Charlie did on using it for ridge regression. And when I asked >> internally I was told that Flink is where at least my company seems to be >> going for such jobs. So when I get a chance to dive into that, I'll be >> learning how to do it in Flink. >> >> >> >> jon >> >> >> >> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <ben.k...@imply.io> wrote: >> >> I can't answer about Spark or Flink, but as a druid person, I'll put in a >> plug for druid for the "if necessary" case. It can ingest from kafka and >> aggregate and do sketches during ingestion. (It's a whole new ballpark, >> though, if you're not already using it.) >> >> >> >> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <agarl...@expediagroup.com> >> wrote: >> >> Hi >> >> >> >> New to DataSketches and looking forward to using, seems like a great >> library. >> >> >> >> My team are evaluating it to profile streaming data (in Kafka) in >> 5-minute windows. The obvious options for stream processing (given >> experience within our org) would be either Flink or Spark Streaming. >> >> >> >> Two questions: >> >> - Would I be right in thinking that there are not existing >> integrations as libraries for either of these platforms? Absolutely fine >> if >> not, just confirming understanding. >> - Is there any view (from either the maintainers or the wider >> community) on whether either of those two are easier to integrate with >> DataSketches? We would also consider other streaming platforms if >> necessary, but as mentioned wider usage within the org would lean us >> against that if at all possible. >> >> >> >> Many thanks >> >>