It would allow for the columnar processing to be extended through the shuffle. So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
On Wed, May 15, 2019 at 12:15 PM Imran Rashid <iras...@cloudera.com.invalid> wrote: > sorry I am late to the discussion here -- the jira mentions using this > extensions for dealing with shuffles, can you explain that part? I don't > see how you would use this to change shuffle behavior at all. > > On Tue, May 14, 2019 at 10:59 AM Thomas graves <tgra...@apache.org> wrote: > >> Thanks for replying, I'll extend the vote til May 26th to allow your >> and other people feedback who haven't had time to look at it. >> >> Tom >> >> On Mon, May 13, 2019 at 4:43 PM Holden Karau <hol...@pigscanfly.ca> >> wrote: >> > >> > I’d like to ask this vote period to be extended, I’m interested but I >> don’t have the cycles to review it in detail and make an informed vote >> until the 25th. >> > >> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <m...@databricks.com> >> wrote: >> >> >> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't >> feel strongly about it. I would still suggest doing the following: >> >> >> >> 1. Link the POC mentioned in Q4. So people can verify the POC result. >> >> 2. List public APIs we plan to expose in Appendix A. I did a quick >> check. Beside ColumnarBatch and ColumnarVector, we also need to make the >> following public. People who are familiar with SQL internals should help >> assess the risk. >> >> * ColumnarArray >> >> * ColumnarMap >> >> * unsafe.types.CaledarInterval >> >> * ColumnarRow >> >> * UTF8String >> >> * ArrayData >> >> * ... >> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match >> the purpose of this SPIP. It does make some code cleaner. But I guess for >> ETL use cases, it won't bring much value. >> >> >> > -- >> > Twitter: https://twitter.com/holdenkarau >> > Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>