One more point I forgot to mention. Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch hive package should just work out of the box.
Seth [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <sjwies...@gmail.com> wrote: > Hi Lee, > > I really like this project, I used it with Flink a few years ago when it > was still Yahoo DataSketches. The projects clearly complement each other. > As Arvid mentioned, the Flink community is trying to foster an ecosystem > larger than what is in the main Flink repository. The reason is that the > project has grown to such a scale that it cannot reasonably maintain > everything. To encourage that sort of growth, Flink is extensively > pluggable which means that components do not need to live within the main > repository to be treated first-class. > > I'd like to outline somethings the DataSketch community could do to > integrate with Flink. > > 1) Create a page on the flink packages website. > > The flink community hosts a website call flink packages to increase the > visibility of ecosystem projects with the flink user base[1]. Datasketches > are usable from Flink today so I'd encourage you to create a page right > away. > > 2) Implement TypeInformation for DataSketches > > TypeInformation is Flink's internal type system and is used as a factory > for creating serializing for different types. These serializers are what > Flink uses when shuffling data around the cluster and when storing records > in state backends as state. Providing type information instances for the > different sketch types, which would just be wrappers around existing > serializers in the data sketch codebase. This should be relatively > straightforward. There is no DataStream aggregation API in the way you are > describing so this is the *only* step you would need to take to provide > first-class support for Flink DataStream API[2][3]. > > 3) Implement sketch UDFs > > Along with its Java API, Flink also offers a relational API and UDFs. The > community could provide UDFs for datasketches like Hive. To do so only > requires implementing the aggregation function interface[4]. Flink SQL > offers the concept of modules, which are a collection of SQL UDFs that can > easily be loaded in the system[5]. A DataSketch SQL module would provide a > simple way for users to get started and expose these UDFs as if they were > native to Flink. > > I hope this helps, I look forward to watching the DataSketch community > grow! > > Seth > > [1] https://flink-packages.org/ > [2] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html > [3] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html > [4] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions > [5] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html > > > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> If this can encourage Lee I'm one of the Flink users that already use >> datasketches and I found it an amazing library. >> When I was trying it out (lat year) I tried to stimulate some >> discussion[1] >> but at that time it was probably too early.. >> I really hope that now things are mature for both communities! >> >> [1] >> >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html >> >> Best, >> Flavio >> >> On Mon, Apr 27, 2020 at 7:37 PM leerho <lee...@gmail.com> wrote: >> >> > Hi Arvid, >> > >> > Note: I am dual listing this thread on both dev lists for better >> tracking. >> > >> > 1. I'm curious on how you would estimate the effort to port >> datasketches >> > > to Flink? It already has a Java API, but how difficult would it be >> to >> > > subdivide the tasks into parallel chunks of work? Since it's >> already >> > > ported >> > > on Pig, I think we could use this port as a baseline >> > >> > >> > Most systems (including systems like Druid, Hive, Pig, Spark, >> PostgreSQL, >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some >> sort >> > of aggregation API, which allows users to plug in custom aggregation >> > functions. Typical API functions found in these APIs are Initialize(), >> > Update() (or Add()), Merge(), and getResult(). How these are named and >> > operate vary considerably from system to system. These APIs are >> sometimes >> > called User Defined Functions (UDFs) or User Defined Aggregation >> Functions >> > (UDAFs). >> > >> > DataSketches is a library of Sketching (streaming) aggregation >> functions, >> > each of which perform specific types of aggregation. For example, >> counting >> > unique items, determining quantiles and histograms of unknown >> > distributions, identifying most frequent items (heavy hitters) from a >> > stream, etc. The advantage of using DataSketches is that they are >> > extremely fast, small in size, and have well defined error properties >> > defined by published scientific papers that define the underlying >> > mathematics. >> > >> > The task of porting DataSketches is usually developing a thin wrapping >> > layer that translates the specific UDAF API of Flink to the equivalent >> API >> > methods of the targeted sketches in the library. This is best done by >> > someone with deep knowledge of the UDAF code of the targeted system. >> We >> > are certainly available answer questions about the DataSketches APIs. >> > Although we did write the UDAF layers for Hive and Pig, we did that as >> a >> > proof of concept and example on how to write such layers. We are a >> small >> > team and are not in a position to support these integration layers for >> > every system out there. >> > >> > 2. Do you have any idea who is usually driving the adoptions? >> > >> > >> > To start, you only need to write the UDAF layer for the sketches that >> you >> > think would be in most demand by your users. The big 4 categories are >> > distinct (unique) counting, quantiles, frequent-items, and sampling. >> This >> > is a natural way of subdividing the task: choose the sketches you want >> to >> > adapt and in what order. Each sketch is independent so it can be >> adapted >> > whenever it is needed. >> > >> > Please let us know if you have any further questions :) >> > >> > Lee. >> > >> > >> > >> > >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <ar...@ververica.com> >> wrote: >> > >> > > Hi Lee, >> > > >> > > I must admit that I also heard of data sketches for the first time >> (there >> > > are really many Apache projects). >> > > >> > > Datasketches sounds really exciting. As a (former) data engineer, I >> can >> > > 100% say that this is something that (end-)users want and need and it >> > would >> > > make so much sense to have it in Flink from the get-go. >> > > Flink, however, is a quite old project already, which grew at a strong >> > pace >> > > leading to some 150 modules in the core. We are currently in the >> process >> > to >> > > restructure that and reduce the number of things in the core, such >> that >> > > build times and stability improve. >> > > >> > > To counter that we created Flink packages [1], which includes >> everything >> > > new that we deem to not be essential. I'd propose to incorporate a >> Flink >> > > datasketch package there. If it seems like it's becoming essential, we >> > can >> > > still move it to core at a later point. >> > > >> > > As I have seen on the page, there are already plenty of adoptions. >> That >> > > leaves a few questions to me. >> > > >> > > 1. I'm curious on how you would estimate the effort to port >> > datasketches >> > > to Flink? It already has a Java API, but how difficult would it be >> to >> > > subdivide the tasks into parallel chunks of work? Since it's >> already >> > > ported >> > > on Pig, I think we could use this port as a baseline. >> > > 2. Do you have any idea who is usually driving the adoptions? >> > > >> > > >> > > [1] https://flink-packages.org/ >> > > >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote: >> > > >> > > > Hello All, >> > > > >> > > > I am a committer on DataSketches.apache.org >> > > > <http://datasketches.apache.org/> and just learning about Flink, >> > Since >> > > > Flink is designed for stateful stream processing I would think it >> would >> > > > make sense to have the DataSketches library integrated into its >> core so >> > > all >> > > > users of Flink could take advantage of these advanced streaming >> > > > algorithms. If there is interest in the Flink community for this >> > > > capability, please contact us at d...@datasketches.apache.org or on >> our >> > > > datasketches-dev Slack channel. >> > > > Cheers, >> > > > Lee. >> > > > >> > > >> > > >> > > -- >> > > >> > > Arvid Heise | Senior Java Developer >> > > >> > > <https://www.ververica.com/> >> > > >> > > Follow us @VervericaData >> > > >> > > -- >> > > >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink >> > > Conference >> > > >> > > Stream Processing | Event Driven | Real Time >> > > >> > > -- >> > > >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> > > >> > > -- >> > > Ververica GmbH >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, >> Ji >> > > (Toni) Cheng >> > > >> >