Re: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread david zollo
Hi guys,
Remote Shuffle Service plays an very import role in the modern big data
stack. As a mentor of this project, I'm very glad to see this project can
join the Apache Incubator.


Best Regards

---
Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
David
Linkedin: https://www.linkedin.com/in/davidzollo
Twitter: @WorkflowEasy 
---


On Sat, Sep 24, 2022 at 1:10 PM Benedict Jin  wrote:

> Hi,
>
> +1, it's a wonderful project, best of luck!
>
> Best Regards,
> Benedict Jin
>
> On 2022/09/23 15:11:51 Kelu Tao wrote:
> > Cool. Good Luck ~
> >
> > On 2022/09/22 03:45:10 Yu Li wrote:
> > > Hi All,
> > >
> > > I would like to propose Datark [1] as a new apache incubator project,
> and
> > > you can find the proposal [2] of Datark for more details.
> > >
> > > Datark is an intermediate (shuffle and spilled) data service for big
> data
> > > compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to
> boost
> > > performance, stability, and flexibility. It aims at enabling computing
> > > engines to fully embrace the disaggregated architecture. In a lot of
> cases,
> > > intermediate data depends on large local disks, and is often a major
> cause
> > > of inefficiency, instability, and inflexibility in the lifecycle of a
> > > distributed job. Datark solves the problems through the following core
> > > designs:
> > >
> > > 1. Push-based shuffle plus partition data aggregation to turn random IO
> > > access into sequential access.
> > > 2. FileSystem-like API to support writing spilled data.
> > > 3. Hierarchical storage from memory to DFS/object store to enable fast
> > > cache and massive storage space.
> > > 4. Engine-irrelevant APIs for easy integrating to various engines.
> > > 5. Extended fault tolerance and data replication to increase
> reliability
> > >
> > > Datark is currently adopted in the production environment at both
> Alibaba
> > > and many other companies, serving petabytes of data per day. Beyond
> that,
> > > it has more open source users including Shopee, NetEase, Bilibily,
> BOSS,
> > > and Synnex. Most of these users have made contributions to the project,
> > > forming an active community with dozens of developers.
> > >
> > > The proposed initial committers are interested in joining ASF to
> reinforce
> > > extensive collaboration and build a more vibrant community. We believe
> the
> > > Datark project will provide tremendous value for the community if it is
> > > introduced into the Apache incubator.
> > >
> > > I will help this project as the champion and many thanks to our four
> other
> > > mentors:
> > >
> > > * Becket Qin (j...@apache.org)
> > > * Duo Zhang (zhang...@apache.org)
> > > * Lidong Dai (lidong...@apache.org)
> > > * Willem Jiang (ningji...@apache.org)
> > >
> > > FWIW, although with different solutions, the issues Datark aims to
> resolve
> > > have some overlap with Apache Uniffle (incubating) [3]. Actually we
> noticed
> > > this during the discussion phase of Uniffle incubation (when we were
> also
> > > preparing for the incubation) and had some open and friendly
> discussion to
> > > see whether there could be a joint force [4], and finally decided to
> > > develop independently for the time being [5].
> > >
> > > Look forward to your feedback. Thanks.
> > >
> > > Best Regards,
> > > Yu
> > >
> > > [1] https://github.com/alibaba/RemoteShuffleService
> > > [2]
> https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
> > > [3] https://uniffle.apache.org/
> > > [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
> > > [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw
> > >
> >
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread MINX Feng
It is an interesting project. Good luck to Datark, may this project lives long 
and prosper.

Best wishes!
Ethan

> 2022年9月22日 11:45,Yu Li  写道:
> 
> Hi All,
> 
> I would like to propose Datark [1] as a new apache incubator project, and
> you can find the proposal [2] of Datark for more details.
> 
> Datark is an intermediate (shuffle and spilled) data service for big data
> compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
> performance, stability, and flexibility. It aims at enabling computing
> engines to fully embrace the disaggregated architecture. In a lot of cases,
> intermediate data depends on large local disks, and is often a major cause
> of inefficiency, instability, and inflexibility in the lifecycle of a
> distributed job. Datark solves the problems through the following core
> designs:
> 
> 1. Push-based shuffle plus partition data aggregation to turn random IO
> access into sequential access.
> 2. FileSystem-like API to support writing spilled data.
> 3. Hierarchical storage from memory to DFS/object store to enable fast
> cache and massive storage space.
> 4. Engine-irrelevant APIs for easy integrating to various engines.
> 5. Extended fault tolerance and data replication to increase reliability
> 
> Datark is currently adopted in the production environment at both Alibaba
> and many other companies, serving petabytes of data per day. Beyond that,
> it has more open source users including Shopee, NetEase, Bilibily, BOSS,
> and Synnex. Most of these users have made contributions to the project,
> forming an active community with dozens of developers.
> 
> The proposed initial committers are interested in joining ASF to reinforce
> extensive collaboration and build a more vibrant community. We believe the
> Datark project will provide tremendous value for the community if it is
> introduced into the Apache incubator.
> 
> I will help this project as the champion and many thanks to our four other
> mentors:
> 
> * Becket Qin (j...@apache.org)
> * Duo Zhang (zhang...@apache.org)
> * Lidong Dai (lidong...@apache.org)
> * Willem Jiang (ningji...@apache.org)
> 
> FWIW, although with different solutions, the issues Datark aims to resolve
> have some overlap with Apache Uniffle (incubating) [3]. Actually we noticed
> this during the discussion phase of Uniffle incubation (when we were also
> preparing for the incubation) and had some open and friendly discussion to
> see whether there could be a joint force [4], and finally decided to
> develop independently for the time being [5].
> 
> Look forward to your feedback. Thanks.
> 
> Best Regards,
> Yu
> 
> [1] https://github.com/alibaba/RemoteShuffleService
> [2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
> [3] https://uniffle.apache.org/
> [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
> [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread MINX Feng
+1

> 2022年9月25日 08:44,MINX Feng  写道:
> 
> It is an interesting project. Good luck to Datark, may this project lives 
> long and prosper.
> 
> Best wishes!
> Ethan
> 
>> 2022年9月22日 11:45,Yu Li  写道:
>> 
>> Hi All,
>> 
>> I would like to propose Datark [1] as a new apache incubator project, and
>> you can find the proposal [2] of Datark for more details.
>> 
>> Datark is an intermediate (shuffle and spilled) data service for big data
>> compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
>> performance, stability, and flexibility. It aims at enabling computing
>> engines to fully embrace the disaggregated architecture. In a lot of cases,
>> intermediate data depends on large local disks, and is often a major cause
>> of inefficiency, instability, and inflexibility in the lifecycle of a
>> distributed job. Datark solves the problems through the following core
>> designs:
>> 
>> 1. Push-based shuffle plus partition data aggregation to turn random IO
>> access into sequential access.
>> 2. FileSystem-like API to support writing spilled data.
>> 3. Hierarchical storage from memory to DFS/object store to enable fast
>> cache and massive storage space.
>> 4. Engine-irrelevant APIs for easy integrating to various engines.
>> 5. Extended fault tolerance and data replication to increase reliability
>> 
>> Datark is currently adopted in the production environment at both Alibaba
>> and many other companies, serving petabytes of data per day. Beyond that,
>> it has more open source users including Shopee, NetEase, Bilibily, BOSS,
>> and Synnex. Most of these users have made contributions to the project,
>> forming an active community with dozens of developers.
>> 
>> The proposed initial committers are interested in joining ASF to reinforce
>> extensive collaboration and build a more vibrant community. We believe the
>> Datark project will provide tremendous value for the community if it is
>> introduced into the Apache incubator.
>> 
>> I will help this project as the champion and many thanks to our four other
>> mentors:
>> 
>> * Becket Qin (j...@apache.org)
>> * Duo Zhang (zhang...@apache.org)
>> * Lidong Dai (lidong...@apache.org)
>> * Willem Jiang (ningji...@apache.org)
>> 
>> FWIW, although with different solutions, the issues Datark aims to resolve
>> have some overlap with Apache Uniffle (incubating) [3]. Actually we noticed
>> this during the discussion phase of Uniffle incubation (when we were also
>> preparing for the incubation) and had some open and friendly discussion to
>> see whether there could be a joint force [4], and finally decided to
>> develop independently for the time being [5].
>> 
>> Look forward to your feedback. Thanks.
>> 
>> Best Regards,
>> Yu
>> 
>> [1] https://github.com/alibaba/RemoteShuffleService
>> [2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
>> [3] https://uniffle.apache.org/
>> [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
>> [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw
> 
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread MINX Feng
+1

> 2022年9月25日 08:44,MINX Feng  写道:
> 
> It is an interesting project. Good luck to Datark, may this project lives 
> long and prosper.
> 
> Best wishes!
> Ethan
> 
>> 2022年9月22日 11:45,Yu Li  写道:
>> 
>> Hi All,
>> 
>> I would like to propose Datark [1] as a new apache incubator project, and
>> you can find the proposal [2] of Datark for more details.
>> 
>> Datark is an intermediate (shuffle and spilled) data service for big data
>> compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
>> performance, stability, and flexibility. It aims at enabling computing
>> engines to fully embrace the disaggregated architecture. In a lot of cases,
>> intermediate data depends on large local disks, and is often a major cause
>> of inefficiency, instability, and inflexibility in the lifecycle of a
>> distributed job. Datark solves the problems through the following core
>> designs:
>> 
>> 1. Push-based shuffle plus partition data aggregation to turn random IO
>> access into sequential access.
>> 2. FileSystem-like API to support writing spilled data.
>> 3. Hierarchical storage from memory to DFS/object store to enable fast
>> cache and massive storage space.
>> 4. Engine-irrelevant APIs for easy integrating to various engines.
>> 5. Extended fault tolerance and data replication to increase reliability
>> 
>> Datark is currently adopted in the production environment at both Alibaba
>> and many other companies, serving petabytes of data per day. Beyond that,
>> it has more open source users including Shopee, NetEase, Bilibily, BOSS,
>> and Synnex. Most of these users have made contributions to the project,
>> forming an active community with dozens of developers.
>> 
>> The proposed initial committers are interested in joining ASF to reinforce
>> extensive collaboration and build a more vibrant community. We believe the
>> Datark project will provide tremendous value for the community if it is
>> introduced into the Apache incubator.
>> 
>> I will help this project as the champion and many thanks to our four other
>> mentors:
>> 
>> * Becket Qin (j...@apache.org)
>> * Duo Zhang (zhang...@apache.org)
>> * Lidong Dai (lidong...@apache.org)
>> * Willem Jiang (ningji...@apache.org)
>> 
>> FWIW, although with different solutions, the issues Datark aims to resolve
>> have some overlap with Apache Uniffle (incubating) [3]. Actually we noticed
>> this during the discussion phase of Uniffle incubation (when we were also
>> preparing for the incubation) and had some open and friendly discussion to
>> see whether there could be a joint force [4], and finally decided to
>> develop independently for the time being [5].
>> 
>> Look forward to your feedback. Thanks.
>> 
>> Best Regards,
>> Yu
>> 
>> [1] https://github.com/alibaba/RemoteShuffleService
>> [2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
>> [3] https://uniffle.apache.org/
>> [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
>> [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw
> 
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



撤回: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread Feng Ethan
Feng Ethan 将撤回邮件“[DISCUSS] Incubating Proposal for Datark”。

撤回: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread Feng Ethan
Feng Ethan 将撤回邮件“[DISCUSS] Incubating Proposal for Datark”。

Re: [DISCUSS] Incubating Proposal for Datark

2022-09-24 Thread Gabriel Lee
This is my +1.
Datark became a popular remote shuffle service in the past few years and
I'm very glad to see it will be one of us soon.

Have fun and enjoy it!

Best,
Gabriel

On Sun, 25 Sept 2022 at 08:45, MINX Feng  wrote:

> It is an interesting project. Good luck to Datark, may this project lives
> long and prosper.
>
> Best wishes!
> Ethan
>
> > 2022年9月22日 11:45,Yu Li  写道:
> >
> > Hi All,
> >
> > I would like to propose Datark [1] as a new apache incubator project, and
> > you can find the proposal [2] of Datark for more details.
> >
> > Datark is an intermediate (shuffle and spilled) data service for big data
> > compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
> > performance, stability, and flexibility. It aims at enabling computing
> > engines to fully embrace the disaggregated architecture. In a lot of
> cases,
> > intermediate data depends on large local disks, and is often a major
> cause
> > of inefficiency, instability, and inflexibility in the lifecycle of a
> > distributed job. Datark solves the problems through the following core
> > designs:
> >
> > 1. Push-based shuffle plus partition data aggregation to turn random IO
> > access into sequential access.
> > 2. FileSystem-like API to support writing spilled data.
> > 3. Hierarchical storage from memory to DFS/object store to enable fast
> > cache and massive storage space.
> > 4. Engine-irrelevant APIs for easy integrating to various engines.
> > 5. Extended fault tolerance and data replication to increase reliability
> >
> > Datark is currently adopted in the production environment at both Alibaba
> > and many other companies, serving petabytes of data per day. Beyond that,
> > it has more open source users including Shopee, NetEase, Bilibily, BOSS,
> > and Synnex. Most of these users have made contributions to the project,
> > forming an active community with dozens of developers.
> >
> > The proposed initial committers are interested in joining ASF to
> reinforce
> > extensive collaboration and build a more vibrant community. We believe
> the
> > Datark project will provide tremendous value for the community if it is
> > introduced into the Apache incubator.
> >
> > I will help this project as the champion and many thanks to our four
> other
> > mentors:
> >
> > * Becket Qin (j...@apache.org)
> > * Duo Zhang (zhang...@apache.org)
> > * Lidong Dai (lidong...@apache.org)
> > * Willem Jiang (ningji...@apache.org)
> >
> > FWIW, although with different solutions, the issues Datark aims to
> resolve
> > have some overlap with Apache Uniffle (incubating) [3]. Actually we
> noticed
> > this during the discussion phase of Uniffle incubation (when we were also
> > preparing for the incubation) and had some open and friendly discussion
> to
> > see whether there could be a joint force [4], and finally decided to
> > develop independently for the time being [5].
> >
> > Look forward to your feedback. Thanks.
> >
> > Best Regards,
> > Yu
> >
> > [1] https://github.com/alibaba/RemoteShuffleService
> > [2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
> > [3] https://uniffle.apache.org/
> > [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
> > [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw
>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>