Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

Saisai Shao Wed, 12 Jun 2019 23:53:14 -0700

I think maybe we could start a vote on this SPIP.

This has been discussed for a while, and the current doc is pretty complete
as for now. Also we saw lots of demands in the community about building
their own shuffle storage.


Thanks
Saisai

Imran Rashid <iras...@apache.org> 于2019年6月11日周二 上午3:27写道：

> I would be happy to shepherd this.
>
> On Wed, Jun 5, 2019 at 7:33 PM Matt Cheah <mch...@palantir.com> wrote:
>
>> Hi everyone,
>>
>>
>>
>> I wanted to pick this back up again. The discussion has quieted down both
>> on this thread and on the document.
>>
>>
>>
>> We made a few revisions to the document to hopefully make it easier to
>> read and to clarify our criteria for success in the project. Some of the
>> APIs have also been adjusted based on further discussion and things we’ve
>> learned.
>>
>>
>>
>> I was hoping to discuss what our next steps could be here. Specifically,
>>
>>    1. Would any PMC be willing to become the shepherd for this SPIP?
>>    2. Is there any more feedback regarding this proposal?
>>    3. What would we need to do to take this to a voting phase and to
>>    begin proposing our work against upstream Spark?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *"Yifei Huang (PD)" <yif...@palantir.com>
>> *Date: *Monday, May 13, 2019 at 1:04 PM
>> *To: *Mridul Muralidharan <mri...@gmail.com>
>> *Cc: *Bo Yang <b...@uber.com>, Ilan Filonenko <i...@cornell.edu>, Imran
>> Rashid <iras...@cloudera.com>, Justin Uang <ju...@palantir.com>, Liang
>> Tang <lat...@linkedin.com>, Marcelo Vanzin <van...@cloudera.com>, Matei
>> Zaharia <matei.zaha...@gmail.com>, Matt Cheah <mch...@palantir.com>, Min
>> Shen <ms...@linkedin.com>, Reynold Xin <r...@databricks.com>, Ryan Blue <
>> rb...@netflix.com>, Vinoo Ganesh <vgan...@palantir.com>, Will Manning <
>> wmann...@palantir.com>, "b...@fb.com" <b...@fb.com>, "
>> dev@spark.apache.org" <dev@spark.apache.org>, "fel...@uber.com" <
>> fel...@uber.com>, "f...@linkedin.com" <f...@linkedin.com>, "
>> tgraves...@gmail.com" <tgraves...@gmail.com>, "yez...@linkedin.com" <
>> yez...@linkedin.com>, "yue...@memverge.com" <yue...@memverge.com>
>> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>>
>>
>>
>> Hi Mridul - thanks for taking the time to give us feedback! Thoughts on
>> the points that you mentioned:
>>
>>
>>
>> The API is meant to work with the existing SortShuffleManager algorithm.
>> There aren't strict requirements on how other ShuffleManager
>> implementations must behave, so it seems impractical to design an API that
>> could also satisfy those unknown requirements. However, we do believe that
>> the API is rather generic, using OutputStreams for writes and InputStreams
>> for reads, and indexing the data by a shuffleId-mapId-reduceId combo, so if
>> other shuffle algorithms treat the data in the same chunks and want an
>> interface for storage, then they can also use this API from within their
>> implementation.
>>
>>
>>
>> About speculative execution, we originally made the assumption that each
>> shuffle task is deterministic, which meant that even if a later mapper
>> overrode a previous committed mapper's value, it's still the same contents.
>> Having searched some tickets and reading
>> https://github.com/apache/spark/pull/22112/files more carefully, I think
>> there are problems with our original thought if the writer writes all
>> attempts of a task to the same location. One example is if the writer
>> implementation writes each partition to the remote host in a sequence of
>> chunks. In such a situation, a reducer might read data half written by the
>> original task and half written by the running speculative task, which will
>> not be the correct contents if the mapper output is unordered. Therefore,
>> writes by a single mapper might have to be transactioned, which is not
>> clear from the API, and seems rather complex to reason about, so we
>> shouldn't expect this from the implementer.
>>
>>
>>
>> However, this doesn't affect the fundamentals of the API since we only
>> need to add an additional attemptId to the storage data index (which can be
>> stored within the MapStatus) to solve the problem of concurrent writes.
>> This would also make it more clear that the writer should use attempt ID as
>> an index to ensure that writes from speculative tasks don't interfere with
>> one another (we can add that to the API docs as well).
>>
>>
>>
>> *From: *Mridul Muralidharan <mri...@gmail.com>
>> *Date: *Wednesday, May 8, 2019 at 8:18 PM
>> *To: *"Yifei Huang (PD)" <yif...@palantir.com>
>> *Cc: *Bo Yang <b...@uber.com>, Ilan Filonenko <i...@cornell.edu>, Imran
>> Rashid <iras...@cloudera.com>, Justin Uang <ju...@palantir.com>, Liang
>> Tang <lat...@linkedin.com>, Marcelo Vanzin <van...@cloudera.com>, Matei
>> Zaharia <matei.zaha...@gmail.com>, Matt Cheah <mch...@palantir.com>, Min
>> Shen <ms...@linkedin.com>, Reynold Xin <r...@databricks.com>, Ryan Blue <
>> rb...@netflix.com>, Vinoo Ganesh <vgan...@palantir.com>, Will Manning <
>> wmann...@palantir.com>, "b...@fb.com" <b...@fb.com>, "
>> dev@spark.apache.org" <dev@spark.apache.org>, "fel...@uber.com" <
>> fel...@uber.com>, "f...@linkedin.com" <f...@linkedin.com>, "
>> tgraves...@gmail.com" <tgraves...@gmail.com>, "yez...@linkedin.com" <
>> yez...@linkedin.com>, "yue...@memverge.com" <yue...@memverge.com>
>> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>>
>>
>>
>>
>>
>> Unfortunately I do not have bandwidth to do a detailed review, but a few
>> things come to mind after a quick read:
>>
>>
>>
>> - While it might be tactically beneficial to align with existing
>> implementation, a clean design which does not tie into existing shuffle
>> implementation would be preferable (if it can be done without over
>> engineering). Shuffle implementation can change and there are custom
>> implementations and experiments which differ quite a bit from what comes
>> with Apache Spark.
>>
>>
>>
>>
>>
>> - Please keep speculative execution in mind while designing the
>> interfaces: in spark, implicitly due to task scheduler logic, you won’t
>> have conflicts at an executor for (shuffleId, mapId) and (shuffleId, mapId,
>> reducerId) tuple.
>>
>> When you externalize it, there can be conflict : passing a way to
>> distinguish different tasks for same partition would be necessary for
>> nontrivial implementations.
>>
>>
>>
>>
>>
>> This would be a welcome and much needed enhancement to spark- looking
>> forward to its progress !
>>
>>
>>
>>
>>
>> Regards,
>>
>> Mridul
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 8, 2019 at 11:24 AM Yifei Huang (PD) <yif...@palantir.com>
>> wrote:
>>
>> Hi everyone,
>>
>> For the past several months, we have been working on an API for pluggable
>> storage of shuffle data. In this SPIP, we describe the proposed API, its
>> implications, and how it fits into other work being done in the Spark
>> shuffle space. If you're interested in Spark shuffle, and especially if you
>> have done some work in this area already, please take a look at the SPIP
>> and give us your thoughts and feedback.
>>
>> Jira Ticket: https://issues.apache.org/jira/browse/SPARK-25299
>> [issues.apache.org]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=P9Y1I_qiLLmmQlbeg4BGts4r8C1kLZn8TFUig7_oM1Q&m=8Ngs33_g8Iqpl1edSO0-rHSDkIiK1dkhzqQ1CRATp6c&s=1syPrDzxWbhcLwdKZUP1cbxk-yOLInSzUaKsDfPoGhw&e=>
>> SPIP: 
>> https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit
>> [docs.google.com]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n-5F0iMSWwhCQ_edit&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=P9Y1I_qiLLmmQlbeg4BGts4r8C1kLZn8TFUig7_oM1Q&m=8Ngs33_g8Iqpl1edSO0-rHSDkIiK1dkhzqQ1CRATp6c&s=-Fd5cE0ONza5uF8eehh8zUc0jio-8kvtuJ53rT5QUgE&e=>
>>
>> Thank you!
>>
>> Yifei Huang and Matt Cheah
>>
>>
>>
>>

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

Reply via email to