Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reynold Xin Mon, 22 Apr 2019 11:07:38 -0700

"if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"


That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < reva...@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ 
> f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> men...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves...@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( reva...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler < cutlerb@ gmail. com (
>>> cutl...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> The Arrow data format is not yet stable, meaning there are no guarantees
>>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>>> have those guarantees but it's hard to say when that will be. The
>>> remaining work to get there can be seen at
>>> https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1.
>>> 0+Milestone (
>>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
>>> ). So yes, it is a risk that exposing Spark data as Arrow could cause an
>>> issue if handled by a different version that is not compatible. That being
>>> said, changes to format are not taken lightly and are backwards compatible
>>> when possible. I think it would be fair to mark the APIs exposing Arrow
>>> data as experimental for the time being, and clearly state the version
>>> that must be used to be compatible in the docs. Also, adding features like
>>> this and SPARK-24579 will probably help adoption of Arrow and accelerate a
>>> 1.0 release. Adding the Arrow dev list to CC.
>>> 
>>> 
>>> 
>>> Bryan
>>> 
>>> 
>>> 
>>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia < matei. zaharia@ gmail. com
>>> ( matei.zaha...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>>> risk breakage when Arrow changes in the future and some libraries using
>>> this feature are begin to use the new Arrow code.
>>> 
>>> 
>>> 
>>> Matei
>>> 
>>> 
>>>> 
>>>> 
>>>> On Apr 20, 2019, at 1:39 PM, Bobby Evans < revans2@ gmail. com (
>>>> reva...@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> I want to be clear that this SPIP is not proposing exposing Arrow
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>>> because of the overlap between the two SPIPs I scaled this one back to
>>> concentrate just on the columnar processing aspects. Sorry for the
>>> confusion as I didn't update the JIRA description clearly enough when we
>>> adjusted it during the discussion on the JIRA. As part of the columnar
>>> processing, we plan on providing arrow formatted data, but that will be
>>> exposed through a Spark owned API.
>>> 
>>> 
>>>> 
>>>> 
>>>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia < matei. zaharia@ gmail. com
>>>> ( matei.zaha...@gmail.com ) >
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> public API if it’s not yet stable. Is stabilization of the API and format
>>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>>> community to make that happen.
>>> 
>>> 
>>>> 
>>>> 
>>>> We’ve been bitten lots of times by API changes forced by external
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> libraries even when those were widely popular. For example, we used
>>> Guava’s Optional for a while, which changed at some point, and we also had
>>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>>> in Java). API breakage might not be as serious in dynamic languages like
>>> Python, where you can often keep compatibility with old behaviors, but it
>>> really hurts in Java and Scala.
>>> 
>>> 
>>>> 
>>>> 
>>>> The problem is especially bad for us because of two aspects of how
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Spark is used:
>>> 
>>> 
>>>> 
>>>> 
>>>> 1) Spark is used for production data transformation jobs that people
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> need to keep running for a long time. Nobody wants to make changes to a
>>> job that’s been working fine and computing something correctly for years
>>> just to get a bug fix from the latest Spark release or whatever. It’s much
>>> better if they can upgrade Spark without editing every job.
>>> 
>>> 
>>>> 
>>>> 
>>>> 2) Spark is often used as “glue” to combine data processing code in
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> other libraries, and these might start to require different versions of
>>> our dependencies. For example, the Guava class exposed in Spark became a
>>> problem when third-party libraries started requiring a new version of
>>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>>> especially bad because some users wanted to read data stored as Protobufs
>>> (or in a format that uses Protobuf inside), so they needed a different
>>> version of the library in their main data processing code.
>>> 
>>> 
>>>> 
>>>> 
>>>> If there was some guarantee that this stuff would remain
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>>> keep a storage format backward-compatible: just document the format and
>>> extend it only in ways that don’t break the meaning of old data (for
>>> example, add new version numbers or field types that are read in a
>>> different way). It’s a bit harder for a Java API, but maybe Spark could
>>> just expose byte arrays directly and work on those if the API is not
>>> guaranteed to stay stable (that is, we’d still use our own classes to
>>> manipulate the data internally, and end users could use the Arrow library
>>> if they want it).
>>> 
>>> 
>>>> 
>>>> 
>>>> Matei
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com (
>>>>> reva...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I think you misunderstood the point of this SPIP. I responded to your
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> comments in the SPIP JIRA.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com (
>>>>> men...@gmail.com ) >
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I posted my comment in the JIRA. Main concerns here:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 1.0 release someday.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Python.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 3. Simple operations, though benefits vectorization, might not be
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> worth the data exchange overhead.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So would an improved Pandas UDF API would be good enough? For
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry that I should join the discussion earlier! Hope it is not too
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> late:)
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcon...@gmail.com 
>>>>> )
>>>>> > wrote:
>>>>> +1 (non-binding) for better columnar data processing support.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: Jules Damji < dmatrix@ comcast. net ( dmat...@comcast.net ) >
>>>>> Sent: Friday, April 19, 2019 12:21 PM
>>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutl...@gmail.com ) >
>>>>> Cc: Dev < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Columnar Processing Support
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> + (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>> 
>>>>> Pardon the dumb thumb typos :)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com (
>>>>> cutl...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org (
>>>>> jl...@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding). Looking forward to seeing better support for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> processing columnar data.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Jason
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> < tgraves_cs@ yahoo. com. invalid ( tgraves...@yahoo.com.invalid ) > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> extended Columnar Processing Support. The proposal is to extend the
>>> support to allow for more columnar processing.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> You can find the full proposal in the jira at:
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 (
>>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a
>>> DISCUSS thread in the dev mailing list.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Please vote as early as you can, I will leave the vote open until
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +0
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --------------------------------------------------------------------- To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reply via email to