Thanks all very much for the responses so far. Definitely useful but I think it might help to narrow focus if I explain a little more context of what we are trying to do.
Firstly, we want to emit the profile metrics as a stream (Kafka topic) as well, which I assume would mean we wouldn’t want to use Druid (which is more in the spirit of a next-gen/ low-latency analytics DB if I understand correctly?) We are definitely interested in Flink as it looks like this may be a good route to create a Kappa architecture with a single set of code handling profiling of batch and stream data sets. Appreciate some of the following may be a bit more about Flink than DataSketches per se but will post for the record. I started looking at the Table/ SQL API as this seems to be something that is being encouraged for Kappa use cases. It looked like the required interface for user-defined aggregate functions in Flink SQL should allow wrapping of the Sketch objects as accumulators, but when we tried this in practice we got issues – Flink can’t extract a data type for CpCSketch, at least partly due to it having private fields (i.e. seed). We’re next looking at whether this is easier using the DataStreams API, if anyone can confirm the following it would be useful: * Would I be right in thinking that where other people have integrated Flink and DataSketches it has been using DataStreams API? * Are there any good code examples publicly available (GitHub?) that might help steer/ validate our approach? In the longer term (later this year), one option we might consider is creating an OSS configurable library/ framework for running checks based on DataSketches in Flink (we also need to see whether for example Bullet already covers a lot of what we need in terms of setting up stream queries). If anyone else feels there is a gap and might be interested in collaborating, please let me know and I can publish more details of what we’re proposing if and when that evolves. Many thanks From: Marko Mušnjak <marko.musn...@gmail.com> Date: Tuesday, 6 April 2021 at 20:21 To: users@datasketches.apache.org <users@datasketches.apache.org> Subject: [External] Re: Choice of Flink vs Spark for using DataSketches with streaming data Hi, I've implemented jobs using datasketches in Kafka Streams, Flink streaming, and in Spark batch (through the Hive UDFs provided). Things went smoothly in all setups, with the gotcha that hive UDFs represent incoming strings as utf-8 byte arrays (or something like that, i forgot by now), so if you're mixing sketches from two sources (Kafka Streams + Spark batch in my case) you have to take care to cast the input items to proper types before adding them to sketches. A mailing list thread concerning that issue: http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser (thread continues into September) Regards, Marko On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jmal...@apache.org<mailto:jmal...@apache.org>> wrote: I'll echo what Ben said -- if a pre-existing solution does what you need, certainly use that. Having said that, I want to revisit frequent directions in light of the work Charlie did on using it for ridge regression. And when I asked internally I was told that Flink is where at least my company seems to be going for such jobs. So when I get a chance to dive into that, I'll be learning how to do it in Flink. jon On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <ben.k...@imply.io<mailto:ben.k...@imply.io>> wrote: I can't answer about Spark or Flink, but as a druid person, I'll put in a plug for druid for the "if necessary" case. It can ingest from kafka and aggregate and do sketches during ingestion. (It's a whole new ballpark, though, if you're not already using it.) On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <agarl...@expediagroup.com<mailto:agarl...@expediagroup.com>> wrote: Hi New to DataSketches and looking forward to using, seems like a great library. My team are evaluating it to profile streaming data (in Kafka) in 5-minute windows. The obvious options for stream processing (given experience within our org) would be either Flink or Spark Streaming. Two questions: * Would I be right in thinking that there are not existing integrations as libraries for either of these platforms? Absolutely fine if not, just confirming understanding. * Is there any view (from either the maintainers or the wider community) on whether either of those two are easier to integrate with DataSketches? We would also consider other streaming platforms if necessary, but as mentioned wider usage within the org would lean us against that if at all possible. Many thanks