+1 as well On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com> wrote:
> +1 > > On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> +1 (non-binding) >> >> Thanks for making the updates reflected in the current PR. It would be >> great to see the doc updated before it is finally published though. >> >> Right now it feels like this SPIP is focused more on getting the basics >> right for what many datasources are already doing in API V1 combined with >> other private APIs, vs pushing forward state of the art for performance. >> >> I think that’s the right approach for this SPIP. We can add the support >> you’re talking about later with a more specific plan that doesn’t block >> fixing the problems that this addresses. >> >> >> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < >> hvanhov...@databricks.com> wrote: >> >>> +1 (binding) >>> >>> I personally believe that there is quite a big difference between having >>> a generic data source interface with a low surface area and pushing down a >>> significant part of query processing into a datasource. The later has much >>> wider wider surface area and will require us to stabilize most of the >>> internal catalyst API's which will be a significant burden on the community >>> to maintain and has the potential to slow development velocity >>> significantly. If you want to write such integrations then you should be >>> prepared to work with catalyst internals and own up to the fact that things >>> might change across minor versions (and in some cases even maintenance >>> releases). If you are willing to go down that road, then your best bet is >>> to use the already existing spark session extensions which will allow you >>> to write such integrations and can be used as an `escape hatch`. >>> >>> >>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> >>> wrote: >>> >>>> +0 (non-binding) >>>> >>>> I think there are benefits to unifying all the Spark-internal >>>> datasources into a common public API for sure. It will serve as a forcing >>>> function to ensure that those internal datasources aren't advantaged vs >>>> datasources developed externally as plugins to Spark, and that all Spark >>>> features are available to all datasources. >>>> >>>> But I also think this read-path proposal avoids the more difficult >>>> questions around how to continue pushing datasource performance forwards. >>>> James Baker (my colleague) had a number of questions about advanced >>>> pushdowns (combined sorting and filtering), and Reynold also noted that >>>> pushdown of aggregates and joins are desirable on longer timeframes as >>>> well. The Spark community saw similar requests, for aggregate pushdown in >>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >>>> in SPARK-12449. Clearly a number of people are interested in this kind of >>>> performance work for datasources. >>>> >>>> To leave enough space for datasource developers to continue >>>> experimenting with advanced interactions between Spark and their >>>> datasources, I'd propose we leave some sort of escape valve that enables >>>> these datasources to keep pushing the boundaries without forking Spark. >>>> Possibly that looks like an additional unsupported/unstable interface that >>>> pushes down an entire (unstable API) logical plan, which is expected to >>>> break API on every release. (Spark attempts this full-plan pushdown, and >>>> if that fails Spark ignores it and continues on with the rest of the V2 API >>>> for compatibility). Or maybe it looks like something else that we don't >>>> know of yet. Possibly this falls outside of the desired goals for the V2 >>>> API and instead should be a separate SPIP. >>>> >>>> If we had a plan for this kind of escape valve for advanced datasource >>>> developers I'd be an unequivocal +1. Right now it feels like this SPIP is >>>> focused more on getting the basics right for what many datasources are >>>> already doing in API V1 combined with other private APIs, vs pushing >>>> forward state of the art for performance. >>>> >>>> Andrew >>>> >>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >>>> suresh.thalam...@gmail.com> wrote: >>>> >>>>> +1 (non-binding) >>>>> >>>>> >>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> In the previous discussion, we decided to split the read and write >>>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call a >>>>> vote for Data Source V2 read path only. >>>>> >>>>> The full document of the Data Source API V2 is: >>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>>>> -Z8qU5Frf6WMQZ6jJVM/edit >>>>> >>>>> The ready-for-review PR that implements the basic infrastructure for >>>>> the read path is: >>>>> https://github.com/apache/spark/pull/19136 >>>>> >>>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>>> >>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>> +0: Don't really care. >>>>> -1: I don't think this is a good idea because of the following >>>>> technical reasons. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Herman van Hövell >>> >>> Software Engineer >>> >>> Databricks Inc. >>> >>> hvanhov...@databricks.com >>> >>> +31 6 420 590 27 >>> >>> databricks.com >>> >>> [image: http://databricks.com] <http://databricks.com/> >>> >>> >>> >>> [image: Announcing Databricks Serverless. The first serverless data >>> science and big data platform. Watch the demo from Spark Summit 2017.] >>> <http://go.databricks.com/announcing-databricks-serverless> >>> >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > >