+1 ! Cheers, Liwei
On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.k...@gmail.com> wrote: > +1 > > Regards, > Vaquar khan > > On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com> wrote: > > +1 > > On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li <gatorsm...@gmail.com> wrote: > >> +1 >> >> Xiao >> >> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin <r...@databricks.com> wrote: >> >>> +1 >>> >>> One thing with MetadataSupport - It's a bad idea to call it that unless >>> adding new functions in that trait wouldn't break source/binary >>> compatibility in the future. >>> >>> >>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> I'm adding my own +1 (binding). >>>> >>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> >>>>> I'm going to update the proposal: for the last point, although the >>>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`) >>>>> mixes data and metadata operations, we are still able to separate them in >>>>> the data source write API. We can have a mix-in trait `MetadataSupport` >>>>> which has a method `create(options)`, so that data sources can mix in this >>>>> trait and provide metadata creation support. Spark will call this `create` >>>>> method inside `DataFrameWriter.save` if the specified data source has it. >>>>> >>>>> Note that file format data sources can ignore this new trait and still >>>>> write data without metadata(it doesn't have metadata anyway). >>>>> >>>>> With this updated proposal, I'm calling a new vote for the data source >>>>> v2 write path. >>>>> >>>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>>> >>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>> +0: Don't really care. >>>>> -1: I don't think this is a good idea because of the following >>>>> technical reasons. >>>>> >>>>> Thanks! >>>>> >>>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> After we merge the infrastructure of data source v2 read path, and >>>>>> have some discussion for the write path, now I'm sending this email to >>>>>> call >>>>>> a vote for Data Source v2 write path. >>>>>> >>>>>> The full document of the Data Source API V2 is: >>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>>>>> -Z8qU5Frf6WMQZ6jJVM/edit >>>>>> >>>>>> The ready-for-review PR that implements the basic infrastructure for >>>>>> the write path: >>>>>> https://github.com/apache/spark/pull/19269 >>>>>> >>>>>> >>>>>> The Data Source V1 write path asks implementations to write a >>>>>> DataFrame directly, which is painful: >>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is not >>>>>> good for maintenance. >>>>>> 2. Data sources may need to preprocess the input data before writing, >>>>>> like cluster/sort the input by some columns. It's better to do the >>>>>> preprocessing in Spark instead of in the data source. >>>>>> 3. Data sources need to take care of transaction themselves, which is >>>>>> hard. And different data sources may come up with a very similar approach >>>>>> for the transaction, which leads to many duplicated codes. >>>>>> >>>>>> To solve these pain points, I'm proposing the data source v2 writing >>>>>> framework which is very similar to the reading framework, i.e., >>>>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >>>>>> >>>>>> Data Source V2 write path follows the existing FileCommitProtocol, >>>>>> and have task/job level commit/abort, so that data sources can implement >>>>>> transaction easier. >>>>>> >>>>>> We can create a mix-in trait for DataSourceV2Writer to specify the >>>>>> requirement for input data, like clustering and ordering. >>>>>> >>>>>> Spark provides a very simple protocol for uses to connect to data >>>>>> sources. A common way to write a dataframe to data sources: >>>>>> `df.write.format(...).option(...).mode(...).save()`. >>>>>> Spark passes the options and save mode to data sources, and schedules >>>>>> the write job on the input data. And the data source should take care of >>>>>> the metadata, e.g., the JDBC data source can create the table if it >>>>>> doesn't >>>>>> exist, or fail the job and ask users to create the table in the >>>>>> corresponding database first. Data sources can define some options for >>>>>> users to carry some metadata information like partitioning/bucketing. >>>>>> >>>>>> >>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>> vote: >>>>>> >>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>> +0: Don't really care. >>>>>> -1: I don't think this is a good idea because of the following >>>>>> technical reasons. >>>>>> >>>>>> Thanks! >>>>>> >>>>> >>>>> >>>> >>> > >