I think maybe we could start a vote on this SPIP. This has been discussed for a while, and the current doc is pretty complete as for now. Also we saw lots of demands in the community about building their own shuffle storage.
Thanks Saisai Imran Rashid <iras...@apache.org> 于2019年6月11日周二 上午3:27写道: > I would be happy to shepherd this. > > On Wed, Jun 5, 2019 at 7:33 PM Matt Cheah <mch...@palantir.com> wrote: > >> Hi everyone, >> >> >> >> I wanted to pick this back up again. The discussion has quieted down both >> on this thread and on the document. >> >> >> >> We made a few revisions to the document to hopefully make it easier to >> read and to clarify our criteria for success in the project. Some of the >> APIs have also been adjusted based on further discussion and things we’ve >> learned. >> >> >> >> I was hoping to discuss what our next steps could be here. Specifically, >> >> 1. Would any PMC be willing to become the shepherd for this SPIP? >> 2. Is there any more feedback regarding this proposal? >> 3. What would we need to do to take this to a voting phase and to >> begin proposing our work against upstream Spark? >> >> >> >> Thanks, >> >> >> >> -Matt Cheah >> >> >> >> *From: *"Yifei Huang (PD)" <yif...@palantir.com> >> *Date: *Monday, May 13, 2019 at 1:04 PM >> *To: *Mridul Muralidharan <mri...@gmail.com> >> *Cc: *Bo Yang <b...@uber.com>, Ilan Filonenko <i...@cornell.edu>, Imran >> Rashid <iras...@cloudera.com>, Justin Uang <ju...@palantir.com>, Liang >> Tang <lat...@linkedin.com>, Marcelo Vanzin <van...@cloudera.com>, Matei >> Zaharia <matei.zaha...@gmail.com>, Matt Cheah <mch...@palantir.com>, Min >> Shen <ms...@linkedin.com>, Reynold Xin <r...@databricks.com>, Ryan Blue < >> rb...@netflix.com>, Vinoo Ganesh <vgan...@palantir.com>, Will Manning < >> wmann...@palantir.com>, "b...@fb.com" <b...@fb.com>, " >> dev@spark.apache.org" <dev@spark.apache.org>, "fel...@uber.com" < >> fel...@uber.com>, "f...@linkedin.com" <f...@linkedin.com>, " >> tgraves...@gmail.com" <tgraves...@gmail.com>, "yez...@linkedin.com" < >> yez...@linkedin.com>, "yue...@memverge.com" <yue...@memverge.com> >> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API >> >> >> >> Hi Mridul - thanks for taking the time to give us feedback! Thoughts on >> the points that you mentioned: >> >> >> >> The API is meant to work with the existing SortShuffleManager algorithm. >> There aren't strict requirements on how other ShuffleManager >> implementations must behave, so it seems impractical to design an API that >> could also satisfy those unknown requirements. However, we do believe that >> the API is rather generic, using OutputStreams for writes and InputStreams >> for reads, and indexing the data by a shuffleId-mapId-reduceId combo, so if >> other shuffle algorithms treat the data in the same chunks and want an >> interface for storage, then they can also use this API from within their >> implementation. >> >> >> >> About speculative execution, we originally made the assumption that each >> shuffle task is deterministic, which meant that even if a later mapper >> overrode a previous committed mapper's value, it's still the same contents. >> Having searched some tickets and reading >> https://github.com/apache/spark/pull/22112/files more carefully, I think >> there are problems with our original thought if the writer writes all >> attempts of a task to the same location. One example is if the writer >> implementation writes each partition to the remote host in a sequence of >> chunks. In such a situation, a reducer might read data half written by the >> original task and half written by the running speculative task, which will >> not be the correct contents if the mapper output is unordered. Therefore, >> writes by a single mapper might have to be transactioned, which is not >> clear from the API, and seems rather complex to reason about, so we >> shouldn't expect this from the implementer. >> >> >> >> However, this doesn't affect the fundamentals of the API since we only >> need to add an additional attemptId to the storage data index (which can be >> stored within the MapStatus) to solve the problem of concurrent writes. >> This would also make it more clear that the writer should use attempt ID as >> an index to ensure that writes from speculative tasks don't interfere with >> one another (we can add that to the API docs as well). >> >> >> >> *From: *Mridul Muralidharan <mri...@gmail.com> >> *Date: *Wednesday, May 8, 2019 at 8:18 PM >> *To: *"Yifei Huang (PD)" <yif...@palantir.com> >> *Cc: *Bo Yang <b...@uber.com>, Ilan Filonenko <i...@cornell.edu>, Imran >> Rashid <iras...@cloudera.com>, Justin Uang <ju...@palantir.com>, Liang >> Tang <lat...@linkedin.com>, Marcelo Vanzin <van...@cloudera.com>, Matei >> Zaharia <matei.zaha...@gmail.com>, Matt Cheah <mch...@palantir.com>, Min >> Shen <ms...@linkedin.com>, Reynold Xin <r...@databricks.com>, Ryan Blue < >> rb...@netflix.com>, Vinoo Ganesh <vgan...@palantir.com>, Will Manning < >> wmann...@palantir.com>, "b...@fb.com" <b...@fb.com>, " >> dev@spark.apache.org" <dev@spark.apache.org>, "fel...@uber.com" < >> fel...@uber.com>, "f...@linkedin.com" <f...@linkedin.com>, " >> tgraves...@gmail.com" <tgraves...@gmail.com>, "yez...@linkedin.com" < >> yez...@linkedin.com>, "yue...@memverge.com" <yue...@memverge.com> >> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API >> >> >> >> >> >> Unfortunately I do not have bandwidth to do a detailed review, but a few >> things come to mind after a quick read: >> >> >> >> - While it might be tactically beneficial to align with existing >> implementation, a clean design which does not tie into existing shuffle >> implementation would be preferable (if it can be done without over >> engineering). Shuffle implementation can change and there are custom >> implementations and experiments which differ quite a bit from what comes >> with Apache Spark. >> >> >> >> >> >> - Please keep speculative execution in mind while designing the >> interfaces: in spark, implicitly due to task scheduler logic, you won’t >> have conflicts at an executor for (shuffleId, mapId) and (shuffleId, mapId, >> reducerId) tuple. >> >> When you externalize it, there can be conflict : passing a way to >> distinguish different tasks for same partition would be necessary for >> nontrivial implementations. >> >> >> >> >> >> This would be a welcome and much needed enhancement to spark- looking >> forward to its progress ! >> >> >> >> >> >> Regards, >> >> Mridul >> >> >> >> >> >> >> >> On Wed, May 8, 2019 at 11:24 AM Yifei Huang (PD) <yif...@palantir.com> >> wrote: >> >> Hi everyone, >> >> For the past several months, we have been working on an API for pluggable >> storage of shuffle data. In this SPIP, we describe the proposed API, its >> implications, and how it fits into other work being done in the Spark >> shuffle space. If you're interested in Spark shuffle, and especially if you >> have done some work in this area already, please take a look at the SPIP >> and give us your thoughts and feedback. >> >> Jira Ticket: https://issues.apache.org/jira/browse/SPARK-25299 >> [issues.apache.org] >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=P9Y1I_qiLLmmQlbeg4BGts4r8C1kLZn8TFUig7_oM1Q&m=8Ngs33_g8Iqpl1edSO0-rHSDkIiK1dkhzqQ1CRATp6c&s=1syPrDzxWbhcLwdKZUP1cbxk-yOLInSzUaKsDfPoGhw&e=> >> SPIP: >> https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit >> [docs.google.com] >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n-5F0iMSWwhCQ_edit&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=P9Y1I_qiLLmmQlbeg4BGts4r8C1kLZn8TFUig7_oM1Q&m=8Ngs33_g8Iqpl1edSO0-rHSDkIiK1dkhzqQ1CRATp6c&s=-Fd5cE0ONza5uF8eehh8zUc0jio-8kvtuJ53rT5QUgE&e=> >> >> Thank you! >> >> Yifei Huang and Matt Cheah >> >> >> >>