Hi Lee, I must admit that I also heard of data sketches for the first time (there are really many Apache projects).
Datasketches sounds really exciting. As a (former) data engineer, I can 100% say that this is something that (end-)users want and need and it would make so much sense to have it in Flink from the get-go. Flink, however, is a quite old project already, which grew at a strong pace leading to some 150 modules in the core. We are currently in the process to restructure that and reduce the number of things in the core, such that build times and stability improve. To counter that we created Flink packages [1], which includes everything new that we deem to not be essential. I'd propose to incorporate a Flink datasketch package there. If it seems like it's becoming essential, we can still move it to core at a later point. As I have seen on the page, there are already plenty of adoptions. That leaves a few questions to me. 1. I'm curious on how you would estimate the effort to port datasketches to Flink? It already has a Java API, but how difficult would it be to subdivide the tasks into parallel chunks of work? Since it's already ported on Pig, I think we could use this port as a baseline. 2. Do you have any idea who is usually driving the adoptions? [1] https://flink-packages.org/ On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote: > Hello All, > > I am a committer on DataSketches.apache.org > <http://datasketches.apache.org/> and just learning about Flink, Since > Flink is designed for stateful stream processing I would think it would > make sense to have the DataSketches library integrated into its core so all > users of Flink could take advantage of these advanced streaming > algorithms. If there is interest in the Flink community for this > capability, please contact us at d...@datasketches.apache.org or on our > datasketches-dev Slack channel. > Cheers, > Lee. > -- Arvid Heise | Senior Java Developer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng