Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-12 Thread vaquar khan
+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu"  wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
>>> wrote:
>>>
 I'm going to update the proposal: for the last point, although the
 user-facing API (`df.write.format(...).option(...).mode(...).save()`)
 mixes data and metadata operations, we are still able to separate them in
 the data source write API. We can have a mix-in trait `MetadataSupport`
 which has a method `create(options)`, so that data sources can mix in this
 trait and provide metadata creation support. Spark will call this `create`
 method inside `DataFrameWriter.save` if the specified data source has it.

 Note that file format data sources can ignore this new trait and still
 write data without metadata(it doesn't have metadata anyway).

 With this updated proposal, I'm calling a new vote for the data source
 v2 write path.

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

 On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
 wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and
> have some discussion for the write path, now I'm sending this email to 
> call
> a vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
> -Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a
> DataFrame directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not
> good for maintenance.
> 2. Data sources may need to preprocess the input data before writing,
> like cluster/sort the input by some columns. It's better to do the
> preprocessing in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data
> sources. A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules
> the write job on the input data. And the data source should take care of
> the metadata, e.g., the JDBC data source can create the table if it 
> doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>>
>>


SparkR is now available on CRAN

2017-10-12 Thread Shivaram Venkataraman
Hi all

I'm happy to announce that the most recent release of Spark, 2.1.2 is now
available for download as an R package from CRAN at
https://cran.r-project.org/web/packages/SparkR/ . This makes it easy to get
started with SparkR for new R users and the package includes code to
download the corresponding Spark binaries.
https://issues.apache.org/jira/browse/SPARK-15799 has more details on this.

Many thanks to everyone who helped put this together -- especially Felix
Cheung for making a number of fixes to meet the CRAN requirements and
Holden Karau for the 2.1.2 release.

Thanks
Shivaram


Re: SparkR is now available on CRAN

2017-10-12 Thread Reynold Xin
This is huge!


On Thu, Oct 12, 2017 at 11:21 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hi all
>
> I'm happy to announce that the most recent release of Spark, 2.1.2 is now
> available for download as an R package from CRAN at
> https://cran.r-project.org/web/packages/SparkR/ . This makes it easy to
> get started with SparkR for new R users and the package includes code to
> download the corresponding Spark binaries. https://issues.
> apache.org/jira/browse/SPARK-15799 has more details on this.
>
> Many thanks to everyone who helped put this together -- especially Felix
> Cheung for making a number of fixes to meet the CRAN requirements and
> Holden Karau for the 2.1.2 release.
>
> Thanks
> Shivaram
>


Re: SparkR is now available on CRAN

2017-10-12 Thread Holden Karau
That's wonderful news! :) Now we have Spark in CRAN, PyPi, and maven so the
on-rap should be easy for every one. Excited to see more SparkR users
joining us :)

On Thu, Oct 12, 2017 at 11:25 AM, Reynold Xin  wrote:

> This is huge!
>
>
> On Thu, Oct 12, 2017 at 11:21 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Hi all
>>
>> I'm happy to announce that the most recent release of Spark, 2.1.2 is now
>> available for download as an R package from CRAN at
>> https://cran.r-project.org/web/packages/SparkR/ . This makes it easy to
>> get started with SparkR for new R users and the package includes code to
>> download the corresponding Spark binaries. https://issues.apach
>> e.org/jira/browse/SPARK-15799 has more details on this.
>>
>> Many thanks to everyone who helped put this together -- especially Felix
>> Cheung for making a number of fixes to meet the CRAN requirements and
>> Holden Karau for the 2.1.2 release.
>>
>> Thanks
>> Shivaram
>>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-12 Thread Liwei Lin
+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan  wrote:

> +1
>
> Regards,
> Vaquar khan
>
> On Oct 11, 2017 10:14 PM, "Weichen Xu"  wrote:
>
> +1
>
> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:
>
>> +1
>>
>> Xiao
>>
>> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>>
>>> +1
>>>
>>> One thing with MetadataSupport - It's a bad idea to call it that unless
>>> adding new functions in that trait wouldn't break source/binary
>>> compatibility in the future.
>>>
>>>
>>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>>
 I'm adding my own +1 (binding).

 On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
 wrote:

> I'm going to update the proposal: for the last point, although the
> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
> mixes data and metadata operations, we are still able to separate them in
> the data source write API. We can have a mix-in trait `MetadataSupport`
> which has a method `create(options)`, so that data sources can mix in this
> trait and provide metadata creation support. Spark will call this `create`
> method inside `DataFrameWriter.save` if the specified data source has it.
>
> Note that file format data sources can ignore this new trait and still
> write data without metadata(it doesn't have metadata anyway).
>
> With this updated proposal, I'm calling a new vote for the data source
> v2 write path.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>
> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
> wrote:
>
>> Hi all,
>>
>> After we merge the infrastructure of data source v2 read path, and
>> have some discussion for the write path, now I'm sending this email to 
>> call
>> a vote for Data Source v2 write path.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for
>> the write path:
>> https://github.com/apache/spark/pull/19269
>>
>>
>> The Data Source V1 write path asks implementations to write a
>> DataFrame directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>> good for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>> To solve these pain points, I'm proposing the data source v2 writing
>> framework which is very similar to the reading framework, i.e.,
>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>
>> Data Source V2 write path follows the existing FileCommitProtocol,
>> and have task/job level commit/abort, so that data sources can implement
>> transaction easier.
>>
>> We can create a mix-in trait for DataSourceV2Writer to specify the
>> requirement for input data, like clustering and ordering.
>>
>> Spark provides a very simple protocol for uses to connect to data
>> sources. A common way to write a dataframe to data sources:
>> `df.write.format(...).option(...).mode(...).save()`.
>> Spark passes the options and save mode to data sources, and schedules
>> the write job on the input data. And the data source should take care of
>> the metadata, e.g., the JDBC data source can create the table if it 
>> doesn't
>> exist, or fail the job and ask users to create the table in the
>> corresponding database first. Data sources can define some options for
>> users to carry some metadata information like partitioning/bucketing.
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your
>> vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Thanks!
>>
>
>

>>>
>
>