"if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed"
That'd be very useful. At least I was confused by what the SPIP was about. No point voting on something when there is still a lot of confusion about what it is. On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < reva...@gmail.com > wrote: > > > > Xiangrui Meng, > > > > I provided some examples in the original discussion thread. > > > > https:/ / lists. apache. org/ thread. html/ > f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@ > %3Cdev. spark. apache. org%3E ( > https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E > ) > > > > But the concrete use case that we have is GPU accelerated ETL on Spark. > Primarily as data preparation and feature engineering for ML tools like > XGBoost, which by the way exposes a Spark specific scala API, not just a > python one. We built a proof of concept and saw decent performance gains. > Enough gains to more than pay for the added cost of a GPU, with the > potential for even better performance in the future. With that proof of > concept, we were able to make all of the processing columnar end-to-end > for many queries so there really wasn't any data conversion costs to > overcome, but we did want the design flexible enough to include a > cost-based optimizer. \ > > > > It looks like there is some confusion around this SPIP especially in how > it relates to features in other SPIPs around data exchange between > different systems. I didn't want to update the text of this SPIP while it > was under an active vote, but if others think it would be helpful, we can > cancel this vote, update the SPIP to clarify exactly what I am proposing, > and then restart the vote after we have gotten more agreement on what APIs > should be exposed. > > > > Thanks, > > > > Bobby > > > > On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com ( > men...@gmail.com ) > wrote: > > >> >> >> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I >> think the SPIP should list a concrete ETL use case (from POC?) that can >> benefit from this *public Java/Scala API, *does *vectorization*, and >> significantly *boosts the performance *even with data conversion overhead. >> >> >> >> >> The current mid-term success (Pandas UDF) doesn't match the purpose of >> SPIP and it can be done without exposing any public APIs. >> >> >> >> Depending how much benefit it brings, we might agree that a public >> Java/Scala API is needed. Then we might want to step slightly into how. I >> saw three options mentioned in the JIRA and discussion threads: >> >> >> >> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow >> library. >> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs. >> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used >> by Spark internals. It makes us hard to change Spark internals in the >> future. >> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and >> maintain conversion between internal `ColumnarBatch` and >> `SparkRecordBatch`. It might cause conversion overhead in the future if >> our internal becomes different from Arrow. >> >> >> >> Note that both 3 and 4 will make many APIs public to be Arrow compatible. >> So we should really give concrete ETL cases to prove that it is important >> for us to do so. >> >> >> >> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com ( >> tgraves...@yahoo.com ) > wrote: >> >> >>> >>> >>> Based on there is still discussion and Spark Summit is this week, I'm >>> going to extend the vote til Friday the 26th. >>> >>> >>> >>> Tom >>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com >>> ( reva...@gmail.com ) > wrote: >>> >>> >>> >>> Yes, it is technically possible for the layout to change. No, it is not >>> going to happen. It is already baked into several different official >>> libraries which are widely used, not just for holding and processing the >>> data, but also for transfer of the data between the various >>> implementations. There would have to be a really serious reason to force >>> an incompatible change at this point. So in the worst case, we can version >>> the layout and bake that into the API that exposes the internal layout of >>> the data. That way code that wants to program against a JAVA API can do so >>> using the API that Spark provides, those who want to interface with >>> something that expects the data in arrow format will already have to know >>> what version of the format it was programmed against and in the worst case >>> if the layout does change we can support the new layout if needed. >>> >>> >>> >>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler < cutlerb@ gmail. com ( >>> cutl...@gmail.com ) > wrote: >>> >>> >>> >>> The Arrow data format is not yet stable, meaning there are no guarantees >>> on backwards/forwards compatibility. Once version 1.0 is released, it will >>> have those guarantees but it's hard to say when that will be. The >>> remaining work to get there can be seen at >>> https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1. >>> 0+Milestone ( >>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone >>> ). So yes, it is a risk that exposing Spark data as Arrow could cause an >>> issue if handled by a different version that is not compatible. That being >>> said, changes to format are not taken lightly and are backwards compatible >>> when possible. I think it would be fair to mark the APIs exposing Arrow >>> data as experimental for the time being, and clearly state the version >>> that must be used to be compatible in the docs. Also, adding features like >>> this and SPARK-24579 will probably help adoption of Arrow and accelerate a >>> 1.0 release. Adding the Arrow dev list to CC. >>> >>> >>> >>> Bryan >>> >>> >>> >>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia < matei. zaharia@ gmail. com >>> ( matei.zaha...@gmail.com ) > wrote: >>> >>> >>> >>> Okay, that makes sense, but is the Arrow data format stable? If not, we >>> risk breakage when Arrow changes in the future and some libraries using >>> this feature are begin to use the new Arrow code. >>> >>> >>> >>> Matei >>> >>> >>>> >>>> >>>> On Apr 20, 2019, at 1:39 PM, Bobby Evans < revans2@ gmail. com ( >>>> reva...@gmail.com ) > wrote: >>>> >>>> >>>> >>>> I want to be clear that this SPIP is not proposing exposing Arrow >>>> >>>> >>> >>> >>> >>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and >>> because of the overlap between the two SPIPs I scaled this one back to >>> concentrate just on the columnar processing aspects. Sorry for the >>> confusion as I didn't update the JIRA description clearly enough when we >>> adjusted it during the discussion on the JIRA. As part of the columnar >>> processing, we plan on providing arrow formatted data, but that will be >>> exposed through a Spark owned API. >>> >>> >>>> >>>> >>>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia < matei. zaharia@ gmail. com >>>> ( matei.zaha...@gmail.com ) > >>>> >>>> >>> >>> >>> >>> wrote: >>> >>> >>>> >>>> >>>> FYI, I’d also be concerned about exposing the Arrow API or format as a >>>> >>>> >>> >>> >>> >>> public API if it’s not yet stable. Is stabilization of the API and format >>> coming soon on the roadmap there? Maybe someone can work with the Arrow >>> community to make that happen. >>> >>> >>>> >>>> >>>> We’ve been bitten lots of times by API changes forced by external >>>> >>>> >>> >>> >>> >>> libraries even when those were widely popular. For example, we used >>> Guava’s Optional for a while, which changed at some point, and we also had >>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear >>> in Java). API breakage might not be as serious in dynamic languages like >>> Python, where you can often keep compatibility with old behaviors, but it >>> really hurts in Java and Scala. >>> >>> >>>> >>>> >>>> The problem is especially bad for us because of two aspects of how >>>> >>>> >>> >>> >>> >>> Spark is used: >>> >>> >>>> >>>> >>>> 1) Spark is used for production data transformation jobs that people >>>> >>>> >>> >>> >>> >>> need to keep running for a long time. Nobody wants to make changes to a >>> job that’s been working fine and computing something correctly for years >>> just to get a bug fix from the latest Spark release or whatever. It’s much >>> better if they can upgrade Spark without editing every job. >>> >>> >>>> >>>> >>>> 2) Spark is often used as “glue” to combine data processing code in >>>> >>>> >>> >>> >>> >>> other libraries, and these might start to require different versions of >>> our dependencies. For example, the Guava class exposed in Spark became a >>> problem when third-party libraries started requiring a new version of >>> Guava: those new libraries just couldn’t work with Spark. Protobuf was >>> especially bad because some users wanted to read data stored as Protobufs >>> (or in a format that uses Protobuf inside), so they needed a different >>> version of the library in their main data processing code. >>> >>> >>>> >>>> >>>> If there was some guarantee that this stuff would remain >>>> >>>> >>> >>> >>> >>> backward-compatible, we’d be in a much better stuff. It’s not that hard to >>> keep a storage format backward-compatible: just document the format and >>> extend it only in ways that don’t break the meaning of old data (for >>> example, add new version numbers or field types that are read in a >>> different way). It’s a bit harder for a Java API, but maybe Spark could >>> just expose byte arrays directly and work on those if the API is not >>> guaranteed to stay stable (that is, we’d still use our own classes to >>> manipulate the data internally, and end users could use the Arrow library >>> if they want it). >>> >>> >>>> >>>> >>>> Matei >>>> >>>> >>>>> >>>>> >>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com ( >>>>> reva...@gmail.com ) > wrote: >>>>> >>>>> >>>>> >>>>> I think you misunderstood the point of this SPIP. I responded to your >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> comments in the SPIP JIRA. >>> >>> >>>> >>>>> >>>>> >>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com ( >>>>> men...@gmail.com ) > >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> wrote: >>> >>> >>>> >>>>> >>>>> >>>>> I posted my comment in the JIRA. Main concerns here: >>>>> >>>>> >>>>> >>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> 1.0 release someday. >>> >>> >>>> >>>>> >>>>> >>>>> 2. ML/DL systems that can benefits from columnar format are mostly in >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> Python. >>> >>> >>>> >>>>> >>>>> >>>>> 3. Simple operations, though benefits vectorization, might not be >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> worth the data exchange overhead. >>> >>> >>>> >>>>> >>>>> >>>>> So would an improved Pandas UDF API would be good enough? For >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches). >>> >>> >>>> >>>>> >>>>> >>>>> Sorry that I should join the discussion earlier! Hope it is not too >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> late:) >>> >>> >>>> >>>>> >>>>> >>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcon...@gmail.com >>>>> ) >>>>> > wrote: >>>>> +1 (non-binding) for better columnar data processing support. >>>>> >>>>> >>>>> >>>>> From: Jules Damji < dmatrix@ comcast. net ( dmat...@comcast.net ) > >>>>> Sent: Friday, April 19, 2019 12:21 PM >>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutl...@gmail.com ) > >>>>> Cc: Dev < dev@ spark. apache. org ( dev@spark.apache.org ) > >>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> Columnar Processing Support >>> >>> >>>> >>>>> >>>>> >>>>> + (non-binding) >>>>> >>>>> >>>>> >>>>> Sent from my iPhone >>>>> >>>>> >>>>> >>>>> Pardon the dumb thumb typos :) >>>>> >>>>> >>>>> >>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com ( >>>>> cutl...@gmail.com ) > wrote: >>>>> >>>>> >>>>> >>>>> +1 (non-binding) >>>>> >>>>> >>>>> >>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org ( >>>>> jl...@apache.org ) > wrote: >>>>> >>>>> >>>>> >>>>> +1 (non-binding). Looking forward to seeing better support for >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> processing columnar data. >>> >>> >>>> >>>>> >>>>> >>>>> Jason >>>>> >>>>> >>>>> >>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> < tgraves_cs@ yahoo. com. invalid ( tgraves...@yahoo.com.invalid ) > wrote: >>> >>> >>> >>>> >>>>> >>>>> >>>>> Hi everyone, >>>>> >>>>> >>>>> >>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> extended Columnar Processing Support. The proposal is to extend the >>> support to allow for more columnar processing. >>> >>> >>>> >>>>> >>>>> >>>>> You can find the full proposal in the jira at: >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 ( >>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a >>> DISCUSS thread in the dev mailing list. >>> >>> >>>> >>>>> >>>>> >>>>> Please vote as early as you can, I will leave the vote open until >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> next Monday (the 22nd), 2pm CST to give people plenty of time. >>> >>> >>>> >>>>> >>>>> >>>>> [ ] +1: Accept the proposal as an official SPIP >>>>> >>>>> >>>>> >>>>> [ ] +0 >>>>> >>>>> >>>>> >>>>> [ ] -1: I don't think this is a good idea because ... >>>>> >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> Tom Graves >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> --------------------------------------------------------------------- To >>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( >>> dev-unsubscr...@spark.apache.org ) >>> >>> >> >> > > >