Agreed. Tom, could you cancel the vote?
On Mon, Apr 22, 2019 at 1:07 PM Reynold Xin <r...@databricks.com> wrote: > "if others think it would be helpful, we can cancel this vote, update the > SPIP to clarify exactly what I am proposing, and then restart the vote > after we have gotten more agreement on what APIs should be exposed" > > That'd be very useful. At least I was confused by what the SPIP was about. > No point voting on something when there is still a lot of confusion about > what it is. > > > On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <reva...@gmail.com> wrote: > >> Xiangrui Meng, >> >> I provided some examples in the original discussion thread. >> >> >> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E >> >> But the concrete use case that we have is GPU accelerated ETL on Spark. >> Primarily as data preparation and feature engineering for ML tools like >> XGBoost, which by the way exposes a Spark specific scala API, not just a >> python one. We built a proof of concept and saw decent performance gains. >> Enough gains to more than pay for the added cost of a GPU, with the >> potential for even better performance in the future. With that proof of >> concept, we were able to make all of the processing columnar end-to-end for >> many queries so there really wasn't any data conversion costs to overcome, >> but we did want the design flexible enough to include a cost-based >> optimizer. \ >> >> It looks like there is some confusion around this SPIP especially in how >> it relates to features in other SPIPs around data exchange between >> different systems. I didn't want to update the text of this SPIP while it >> was under an active vote, but if others think it would be helpful, we can >> cancel this vote, update the SPIP to clarify exactly what I am proposing, >> and then restart the vote after we have gotten more agreement on what APIs >> should be exposed. >> >> Thanks, >> >> Bobby >> >> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <men...@gmail.com> wrote: >> >> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. >> I think the SPIP should list a concrete ETL use case (from POC?) that can >> benefit from this *public Java/Scala API, *does *vectorization*, and >> significantly *boosts the performance *even with data conversion overhead. >> >> The current mid-term success (Pandas UDF) doesn't match the purpose of >> SPIP and it can be done without exposing any public APIs. >> >> Depending how much benefit it brings, we might agree that a public >> Java/Scala API is needed. Then we might want to step slightly into how. I >> saw three options mentioned in the JIRA and discussion threads: >> >> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an >> Arrow library. >> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs. >> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also >> used by Spark internals. It makes us hard to change Spark internals in the >> future. >> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and >> maintain conversion between internal `ColumnarBatch` and >> `SparkRecordBatch`. It might cause conversion overhead in the future if >> our internal becomes different from Arrow. >> >> Note that both 3 and 4 will make many APIs public to be Arrow compatible. >> So we should really give concrete ETL cases to prove that it is important >> for us to do so. >> >> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tgraves...@yahoo.com> wrote: >> >> Based on there is still discussion and Spark Summit is this week, I'm >> going to extend the vote til Friday the 26th. >> >> Tom >> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <reva...@gmail.com> >> wrote: >> >> Yes, it is technically possible for the layout to change. No, it is not >> going to happen. It is already baked into several different official >> libraries which are widely used, not just for holding and processing the >> data, but also for transfer of the data between the various >> implementations. There would have to be a really serious reason to force an >> incompatible change at this point. So in the worst case, we can version the >> layout and bake that into the API that exposes the internal layout of the >> data. That way code that wants to program against a JAVA API can do so >> using the API that Spark provides, those who want to interface with >> something that expects the data in arrow format will already have to know >> what version of the format it was programmed against and in the worst case >> if the layout does change we can support the new layout if needed. >> >> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutl...@gmail.com> wrote: >> >> The Arrow data format is not yet stable, meaning there are no guarantees >> on backwards/forwards compatibility. Once version 1.0 is released, it will >> have those guarantees but it's hard to say when that will be. The remaining >> work to get there can be seen at >> >> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone. >> So yes, it is a risk that exposing Spark data as Arrow could cause an issue >> if handled by a different version that is not compatible. That being said, >> changes to format are not taken lightly and are backwards compatible when >> possible. I think it would be fair to mark the APIs exposing Arrow data as >> experimental for the time being, and clearly state the version that must be >> used to be compatible in the docs. Also, adding features like this and >> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 >> release. Adding the Arrow dev list to CC. >> >> Bryan >> >> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >> Okay, that makes sense, but is the Arrow data format stable? If not, we >> risk breakage when Arrow changes in the future and some libraries using >> this feature are begin to use the new Arrow code. >> >> Matei >> >> On Apr 20, 2019, at 1:39 PM, Bobby Evans <reva...@gmail.com> wrote: >> >> I want to be clear that this SPIP is not proposing exposing Arrow >> >> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and >> because of the overlap between the two SPIPs I scaled this one back to >> concentrate just on the columnar processing aspects. Sorry for the >> confusion as I didn't update the JIRA description clearly enough when we >> adjusted it during the discussion on the JIRA. As part of the columnar >> processing, we plan on providing arrow formatted data, but that will be >> exposed through a Spark owned API. >> >> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei.zaha...@gmail.com> >> >> wrote: >> >> FYI, I’d also be concerned about exposing the Arrow API or format as a >> >> public API if it’s not yet stable. Is stabilization of the API and format >> coming soon on the roadmap there? Maybe someone can work with the Arrow >> community to make that happen. >> >> We’ve been bitten lots of times by API changes forced by external >> >> libraries even when those were widely popular. For example, we used >> Guava’s Optional for a while, which changed at some point, and we also had >> issues with Protobuf and Scala itself (especially how Scala’s APIs appear >> in Java). API breakage might not be as serious in dynamic languages like >> Python, where you can often keep compatibility with old behaviors, but it >> really hurts in Java and Scala. >> >> The problem is especially bad for us because of two aspects of how >> >> Spark is used: >> >> 1) Spark is used for production data transformation jobs that people >> >> need to keep running for a long time. Nobody wants to make changes to a >> job that’s been working fine and computing something correctly for years >> just to get a bug fix from the latest Spark release or whatever. It’s much >> better if they can upgrade Spark without editing every job. >> >> 2) Spark is often used as “glue” to combine data processing code in >> >> other libraries, and these might start to require different versions of >> our dependencies. For example, the Guava class exposed in Spark became a >> problem when third-party libraries started requiring a new version of >> Guava: those new libraries just couldn’t work with Spark. Protobuf was >> especially bad because some users wanted to read data stored as Protobufs >> (or in a format that uses Protobuf inside), so they needed a different >> version of the library in their main data processing code. >> >> If there was some guarantee that this stuff would remain >> >> backward-compatible, we’d be in a much better stuff. It’s not that hard >> to keep a storage format backward-compatible: just document the format and >> extend it only in ways that don’t break the meaning of old data (for >> example, add new version numbers or field types that are read in a >> different way). It’s a bit harder for a Java API, but maybe Spark could >> just expose byte arrays directly and work on those if the API is not >> guaranteed to stay stable (that is, we’d still use our own classes to >> manipulate the data internally, and end users could use the Arrow library >> if they want it). >> >> Matei >> >> On Apr 20, 2019, at 8:38 AM, Bobby Evans <reva...@gmail.com> wrote: >> >> I think you misunderstood the point of this SPIP. I responded to your >> >> comments in the SPIP JIRA. >> >> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <men...@gmail.com> >> >> wrote: >> >> I posted my comment in the JIRA. Main concerns here: >> >> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have >> >> 1.0 release someday. >> >> 2. ML/DL systems that can benefits from columnar format are mostly in >> >> Python. >> >> 3. Simple operations, though benefits vectorization, might not be >> >> worth the data exchange overhead. >> >> So would an improved Pandas UDF API would be good enough? For >> >> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches). >> >> Sorry that I should join the discussion earlier! Hope it is not too >> >> late:) >> >> On Fri, Apr 19, 2019 at 1:20 PM <tcon...@gmail.com> wrote: >> +1 (non-binding) for better columnar data processing support. >> >> From: Jules Damji <dmat...@comcast.net> >> Sent: Friday, April 19, 2019 12:21 PM >> To: Bryan Cutler <cutl...@gmail.com> >> Cc: Dev <dev@spark.apache.org> >> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended >> >> Columnar Processing Support >> >> + (non-binding) >> >> Sent from my iPhone >> >> Pardon the dumb thumb typos :) >> >> On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutl...@gmail.com> wrote: >> >> +1 (non-binding) >> >> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote: >> >> +1 (non-binding). Looking forward to seeing better support for >> >> processing columnar data. >> >> Jason >> >> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves >> >> <tgraves...@yahoo.com.invalid> wrote: >> >> Hi everyone, >> >> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for >> >> extended Columnar Processing Support. The proposal is to extend the >> support to allow for more columnar processing. >> >> You can find the full proposal in the jira at: >> >> https://issues.apache.org/jira/browse/SPARK-27396. There was also a >> DISCUSS thread in the dev mailing list. >> >> Please vote as early as you can, I will leave the vote open until >> >> next Monday (the 22nd), 2pm CST to give people plenty of time. >> >> [ ] +1: Accept the proposal as an official SPIP >> >> [ ] +0 >> >> [ ] -1: I don't think this is a good idea because ... >> >> Thanks! >> >> Tom Graves >> >> --------------------------------------------------------------------- To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >