Hi All,

I would like to propose Datark [1] as a new apache incubator project, and
you can find the proposal [2] of Datark for more details.

Datark is an intermediate (shuffle and spilled) data service for big data
compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
performance, stability, and flexibility. It aims at enabling computing
engines to fully embrace the disaggregated architecture. In a lot of cases,
intermediate data depends on large local disks, and is often a major cause
of inefficiency, instability, and inflexibility in the lifecycle of a
distributed job. Datark solves the problems through the following core
designs:

1. Push-based shuffle plus partition data aggregation to turn random IO
access into sequential access.
2. FileSystem-like API to support writing spilled data.
3. Hierarchical storage from memory to DFS/object store to enable fast
cache and massive storage space.
4. Engine-irrelevant APIs for easy integrating to various engines.
5. Extended fault tolerance and data replication to increase reliability

Datark is currently adopted in the production environment at both Alibaba
and many other companies, serving petabytes of data per day. Beyond that,
it has more open source users including Shopee, NetEase, Bilibily, BOSS,
and Synnex. Most of these users have made contributions to the project,
forming an active community with dozens of developers.

The proposed initial committers are interested in joining ASF to reinforce
extensive collaboration and build a more vibrant community. We believe the
Datark project will provide tremendous value for the community if it is
introduced into the Apache incubator.

I will help this project as the champion and many thanks to our four other
mentors:

* Becket Qin (j...@apache.org)
* Duo Zhang (zhang...@apache.org)
* Lidong Dai (lidong...@apache.org)
* Willem Jiang (ningji...@apache.org)

FWIW, although with different solutions, the issues Datark aims to resolve
have some overlap with Apache Uniffle (incubating) [3]. Actually we noticed
this during the discussion phase of Uniffle incubation (when we were also
preparing for the incubation) and had some open and friendly discussion to
see whether there could be a joint force [4], and finally decided to
develop independently for the time being [5].

Look forward to your feedback. Thanks.

Best Regards,
Yu

[1] https://github.com/alibaba/RemoteShuffleService
[2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
[3] https://uniffle.apache.org/
[4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
[5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw

Reply via email to