Re: Integration of DataSketches into Flink

Seth Wiesman Mon, 27 Apr 2020 14:16:12 -0700

One more point I forgot to mention.

Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch
hive package should just work out of the box.


Seth

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html

On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <[email protected]> wrote:

> Hi Lee,
>
> I really like this project, I used it with Flink a few years ago when it
> was still Yahoo DataSketches. The projects clearly complement each other.
> As Arvid mentioned, the Flink community is trying to foster an ecosystem
> larger than what is in the main Flink repository. The reason is that the
> project has grown to such a scale that it cannot reasonably maintain
> everything. To encourage that sort of growth, Flink is extensively
> pluggable which means that components do not need to live within the main
> repository to be treated first-class.
>
> I'd like to outline somethings the DataSketch community could do to
> integrate with Flink.
>
> 1) Create a page on the flink packages website.
>
> The flink community hosts a website call flink packages to increase the
> visibility of ecosystem projects with the flink user base[1]. Datasketches
> are usable from Flink today so I'd encourage you to create a page right
> away.
>
> 2) Implement TypeInformation for DataSketches
>
> TypeInformation is Flink's internal type system and is used as a factory
> for creating serializing for different types. These serializers are what
> Flink uses when shuffling data around the cluster and when storing records
> in state backends as state. Providing type information instances for the
> different sketch types, which would just be wrappers around existing
> serializers in the data sketch codebase. This should be relatively
> straightforward. There is no DataStream aggregation API in the way you are
> describing so this is the *only* step you would need to take to provide
> first-class support for Flink DataStream API[2][3].
>
> 3) Implement sketch UDFs
>
> Along with its Java API, Flink also offers a relational API and UDFs. The
> community could provide UDFs for datasketches like Hive. To do so only
> requires implementing the aggregation function interface[4]. Flink SQL
> offers the concept of modules, which are a collection of SQL UDFs that can
> easily be loaded in the system[5]. A DataSketch SQL module would provide a
> simple way for users to get started and expose these UDFs as if they were
> native to Flink.
>
> I hope this helps, I look forward to watching the DataSketch community
> grow!
>
> Seth
>
> [1] https://flink-packages.org/
> [2]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
> [3]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
> [4]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
> [5]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html
>
>
> On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <[email protected]>
> wrote:
>
>> If this can encourage Lee I'm one of the Flink users that already use
>> datasketches and I found it an amazing library.
>> When I was trying it out (lat year) I tried to stimulate some
>> discussion[1]
>> but at that time it was probably too early..
>> I really hope that now things are mature for both communities!
>>
>> [1]
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
>>
>> Best,
>> Flavio
>>
>> On Mon, Apr 27, 2020 at 7:37 PM leerho <[email protected]> wrote:
>>
>> > Hi Arvid,
>> >
>> > Note: I am dual listing this thread on both dev lists for better
>> tracking.
>> >
>> >    1. I'm curious on how you would estimate the effort to port
>> datasketches
>> > >    to Flink? It already has a Java API, but how difficult would it be
>> to
>> > >    subdivide the tasks into parallel chunks of work? Since it's
>> already
>> > > ported
>> > >    on Pig, I think we could use this port as a baseline
>> >
>> >
>> > Most systems (including systems like Druid, Hive, Pig, Spark,
>> PostgreSQL,
>> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some
>> sort
>> > of aggregation API, which allows users to plug in custom aggregation
>> > functions.  Typical API functions found in these APIs are Initialize(),
>> > Update() (or Add()), Merge(), and getResult().  How these are named and
>> > operate vary considerably from system to system.  These APIs are
>> sometimes
>> > called User Defined Functions (UDFs) or User Defined Aggregation
>> Functions
>> > (UDAFs).
>> >
>> > DataSketches is a library of Sketching (streaming) aggregation
>> functions,
>> > each of which perform specific types of aggregation. For example,
>> counting
>> > unique items, determining quantiles and histograms of unknown
>> > distributions, identifying most frequent items (heavy hitters) from a
>> > stream, etc.   The advantage of using DataSketches is that they are
>> > extremely fast, small in size, and have well defined error properties
>> > defined by published scientific papers that define the underlying
>> > mathematics.
>> >
>> > The task of porting DataSketches is usually developing a thin wrapping
>> > layer that translates the specific UDAF API of Flink to the equivalent
>> API
>> > methods of the targeted sketches in the library.  This is best done by
>> > someone with deep knowledge of the UDAF code of the targeted system.
>>  We
>> > are certainly available answer questions about the DataSketches APIs.
>> >  Although we did write the UDAF layers for Hive and Pig, we did that as
>> a
>> > proof of concept and example on how to write such layers.  We are a
>> small
>> > team and are not in a position to support these integration layers for
>> > every system out there.
>> >
>> > 2. Do you have any idea who is usually driving the adoptions?
>> >
>> >
>> > To start, you only need to write the UDAF layer for the sketches that
>> you
>> > think would be in most demand by your users.  The big 4 categories are
>> > distinct (unique) counting, quantiles, frequent-items, and sampling.
>> This
>> > is a natural way of subdividing the task: choose the sketches you want
>> to
>> > adapt and in what order.  Each sketch is independent so it can be
>> adapted
>> > whenever it is needed.
>> >
>> > Please let us know if you have any further questions :)
>> >
>> > Lee.
>> >
>> >
>> >
>> >
>> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[email protected]>
>> wrote:
>> >
>> > > Hi Lee,
>> > >
>> > > I must admit that I also heard of data sketches for the first time
>> (there
>> > > are really many Apache projects).
>> > >
>> > > Datasketches sounds really exciting. As a (former) data engineer, I
>> can
>> > > 100% say that this is something that (end-)users want and need and it
>> > would
>> > > make so much sense to have it in Flink from the get-go.
>> > > Flink, however, is a quite old project already, which grew at a strong
>> > pace
>> > > leading to some 150 modules in the core. We are currently in the
>> process
>> > to
>> > > restructure that and reduce the number of things in the core, such
>> that
>> > > build times and stability improve.
>> > >
>> > > To counter that we created Flink packages [1], which includes
>> everything
>> > > new that we deem to not be essential. I'd propose to incorporate a
>> Flink
>> > > datasketch package there. If it seems like it's becoming essential, we
>> > can
>> > > still move it to core at a later point.
>> > >
>> > > As I have seen on the page, there are already plenty of adoptions.
>> That
>> > > leaves a few questions to me.
>> > >
>> > >    1. I'm curious on how you would estimate the effort to port
>> > datasketches
>> > >    to Flink? It already has a Java API, but how difficult would it be
>> to
>> > >    subdivide the tasks into parallel chunks of work? Since it's
>> already
>> > > ported
>> > >    on Pig, I think we could use this port as a baseline.
>> > >    2. Do you have any idea who is usually driving the adoptions?
>> > >
>> > >
>> > > [1] https://flink-packages.org/
>> > >
>> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[email protected]> wrote:
>> > >
>> > > > Hello All,
>> > > >
>> > > > I am a committer on DataSketches.apache.org
>> > > > <http://datasketches.apache.org/> and just learning about Flink,
>> > Since
>> > > > Flink is designed for stateful stream processing I would think it
>> would
>> > > > make sense to have the DataSketches library integrated into its
>> core so
>> > > all
>> > > > users of Flink could take advantage of these advanced streaming
>> > > > algorithms.  If there is interest in the Flink community for this
>> > > > capability, please contact us at [email protected] or on
>> our
>> > > > datasketches-dev Slack channel.
>> > > > Cheers,
>> > > > Lee.
>> > > >
>> > >
>> > >
>> > > --
>> > >
>> > > Arvid Heise | Senior Java Developer
>> > >
>> > > <https://www.ververica.com/>
>> > >
>> > > Follow us @VervericaData
>> > >
>> > > --
>> > >
>> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> > > Conference
>> > >
>> > > Stream Processing | Event Driven | Real Time
>> > >
>> > > --
>> > >
>> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>> > >
>> > > --
>> > > Ververica GmbH
>> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
>> Ji
>> > > (Toni) Cheng
>> > >
>>
>

Re: Integration of DataSketches into Flink

Reply via email to