If this can encourage Lee I'm one of the Flink users that already use datasketches and I found it an amazing library. When I was trying it out (lat year) I tried to stimulate some discussion[1] but at that time it was probably too early.. I really hope that now things are mature for both communities!
[1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html Best, Flavio On Mon, Apr 27, 2020 at 7:37 PM leerho <lee...@gmail.com> wrote: > Hi Arvid, > > Note: I am dual listing this thread on both dev lists for better tracking. > > 1. I'm curious on how you would estimate the effort to port datasketches > > to Flink? It already has a Java API, but how difficult would it be to > > subdivide the tasks into parallel chunks of work? Since it's already > > ported > > on Pig, I think we could use this port as a baseline > > > Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL, > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort > of aggregation API, which allows users to plug in custom aggregation > functions. Typical API functions found in these APIs are Initialize(), > Update() (or Add()), Merge(), and getResult(). How these are named and > operate vary considerably from system to system. These APIs are sometimes > called User Defined Functions (UDFs) or User Defined Aggregation Functions > (UDAFs). > > DataSketches is a library of Sketching (streaming) aggregation functions, > each of which perform specific types of aggregation. For example, counting > unique items, determining quantiles and histograms of unknown > distributions, identifying most frequent items (heavy hitters) from a > stream, etc. The advantage of using DataSketches is that they are > extremely fast, small in size, and have well defined error properties > defined by published scientific papers that define the underlying > mathematics. > > The task of porting DataSketches is usually developing a thin wrapping > layer that translates the specific UDAF API of Flink to the equivalent API > methods of the targeted sketches in the library. This is best done by > someone with deep knowledge of the UDAF code of the targeted system. We > are certainly available answer questions about the DataSketches APIs. > Although we did write the UDAF layers for Hive and Pig, we did that as a > proof of concept and example on how to write such layers. We are a small > team and are not in a position to support these integration layers for > every system out there. > > 2. Do you have any idea who is usually driving the adoptions? > > > To start, you only need to write the UDAF layer for the sketches that you > think would be in most demand by your users. The big 4 categories are > distinct (unique) counting, quantiles, frequent-items, and sampling. This > is a natural way of subdividing the task: choose the sketches you want to > adapt and in what order. Each sketch is independent so it can be adapted > whenever it is needed. > > Please let us know if you have any further questions :) > > Lee. > > > > > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <ar...@ververica.com> wrote: > > > Hi Lee, > > > > I must admit that I also heard of data sketches for the first time (there > > are really many Apache projects). > > > > Datasketches sounds really exciting. As a (former) data engineer, I can > > 100% say that this is something that (end-)users want and need and it > would > > make so much sense to have it in Flink from the get-go. > > Flink, however, is a quite old project already, which grew at a strong > pace > > leading to some 150 modules in the core. We are currently in the process > to > > restructure that and reduce the number of things in the core, such that > > build times and stability improve. > > > > To counter that we created Flink packages [1], which includes everything > > new that we deem to not be essential. I'd propose to incorporate a Flink > > datasketch package there. If it seems like it's becoming essential, we > can > > still move it to core at a later point. > > > > As I have seen on the page, there are already plenty of adoptions. That > > leaves a few questions to me. > > > > 1. I'm curious on how you would estimate the effort to port > datasketches > > to Flink? It already has a Java API, but how difficult would it be to > > subdivide the tasks into parallel chunks of work? Since it's already > > ported > > on Pig, I think we could use this port as a baseline. > > 2. Do you have any idea who is usually driving the adoptions? > > > > > > [1] https://flink-packages.org/ > > > > On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote: > > > > > Hello All, > > > > > > I am a committer on DataSketches.apache.org > > > <http://datasketches.apache.org/> and just learning about Flink, > Since > > > Flink is designed for stateful stream processing I would think it would > > > make sense to have the DataSketches library integrated into its core so > > all > > > users of Flink could take advantage of these advanced streaming > > > algorithms. If there is interest in the Flink community for this > > > capability, please contact us at d...@datasketches.apache.org or on our > > > datasketches-dev Slack channel. > > > Cheers, > > > Lee. > > > > > > > > > -- > > > > Arvid Heise | Senior Java Developer > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > -- > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > Conference > > > > Stream Processing | Event Driven | Real Time > > > > -- > > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > -- > > Ververica GmbH > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > > (Toni) Cheng > >