Re: CSV Support in SparkR

Shivaram Venkataraman Tue, 02 Jun 2015 14:25:05 -0700

Thanks for testing. We should probably include a section for this in the
SparkR programming guide given how popular CSV files are in R. Feel free to
open a PR for that if you get a chance.


Shivaram

On Tue, Jun 2, 2015 at 2:20 PM, Eskilson,Aleksander <
alek.eskil...@cerner.com> wrote:

>  Seems to work great in the master build. It’s really good to have this
> functionality.
>
>  Regards,
> Alek Eskilson
>
>   From: <Eskilson>, Aleksander Eskilson <alek.eskil...@cerner.com>
> Date: Tuesday, June 2, 2015 at 2:59 PM
> To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>
> Cc: Burak Yavuz <brk...@gmail.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>
>
> Subject: Re: CSV Support in SparkR
>
>   Ah, alright, cool. I’ll rebuild and let you know.
>
>  Thanks again,
> Alek
>
>   From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
> Reply-To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>
> Date: Tuesday, June 2, 2015 at 2:57 PM
> To: Aleksander Eskilson <alek.eskil...@cerner.com>
> Cc: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>, Burak
> Yavuz <brk...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>
> Subject: Re: CSV Support in SparkR
>
>   There was a bug in the SparkContext creation that I fixed yesterday.
> https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_6b44278ef7cd2a278dfa67e8393ef30775c72726&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=kO95UBEkBrQwNCQwa2x0MOiUxhLQvBQ1B2q5EDG_bt4&s=UjoHyjJhx1vf6fqNiq3P-MqcvN2FnssT16FJ8o98pF4&e=>
>
>
>  If you build from master it should be fixed. Also I think we might have
> a rc4 which should have this
>
>  Thanks
> Shivaram
>
> On Tue, Jun 2, 2015 at 12:56 PM, Eskilson,Aleksander <
> alek.eskil...@cerner.com> wrote:
>
>>  Hey, that’s pretty convenient. Unfortunately, although the package
>> seems to pull fine into the session, I’m getting class not found exceptions
>> with:
>>
>>  Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage
>> failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task
>> 0.3 in stage 6.0: java.lang.ClassNotFoundException:
>> com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1
>>
>>  Which smells like a path issue to me, and I made sure the ivy repo was
>> part of my PATH, but functions like showDF() still fail with that error.
>> Did I miss a setting, or should the package inclusion in the sparkR
>> execution load that in?
>>
>>  I’ve run
>> df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”,
>> header=“true”, delimiter=“|”)
>> showDF(df, 10)
>>
>>  (my data is pipeline delimited, and the default SQL context is sqlCtx)
>>
>>  Thanks,
>> Alek
>>
>>   From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
>> Reply-To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>
>> Date: Tuesday, June 2, 2015 at 2:08 PM
>> To: Burak Yavuz <brk...@gmail.com>
>> Cc: Aleksander Eskilson <alek.eskil...@cerner.com>, "dev@spark.apache.org"
>> <dev@spark.apache.org>, Shivaram Venkataraman <shiva...@eecs.berkeley.edu
>> >
>> Subject: Re: CSV Support in SparkR
>>
>>   Hi Alek
>>
>>  As Burak said, you can already use the spark-csv with SparkR in the 1.4
>> release. So right now I use it with something like this
>>
>>  # Launch SparkR
>> ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>>  df <- read.df(sqlContext, "./nycflights13.csv",
>> "com.databricks.spark.csv", header="true")
>>
>>  You can also pass in other options to the spark csv as arguments to
>> `read.df`. Let us know if this works
>>
>>  Thanks
>> Shivaram
>>
>>
>> On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  cc'ing Shivaram here, because he worked on this yesterday.
>>>
>>>  If I'm not mistaken, you can use the following workflow:
>>>  ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```
>>>
>>>  and then
>>>
>>>  ```df <- read.df(sqlContext, "/data", "csv", header = "true")```
>>>
>>>  Best,
>>> Burak
>>>
>>> On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <
>>> alek.eskil...@cerner.com> wrote:
>>>
>>>>  Are there any intentions to provide first class support for CSV files
>>>> as one of the loadable file types in SparkR? Data brick’s spark-csv API [1]
>>>> has support for SQL, Python, and Java/Scala, and implements most of the
>>>> arguments of R’s read.table API [2], but currently there is no way to load
>>>> CSV data in SparkR (1.4.0) besides separating our headers from the data,
>>>> loading into an RDD, splitting by our delimiter, and then converting to a
>>>> SparkR Data Frame with a vector of the columns gathered from the header.
>>>>
>>>>  Regards,
>>>> Alek Eskilson
>>>>
>>>>  [1] -- https://github.com/databricks/spark-csv
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
>>>> [2] -- http://www.inside-r.org/r-doc/utils/read.table
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
>>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>>> from Cerner Corporation and are intended only for the addressee. The
>>>> information contained in this message is confidential and may constitute
>>>> inside or non-public information under international, federal, or state
>>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>>> or use of such information is strictly prohibited and may be unlawful. If
>>>> you are not the addressee, please promptly delete this message and notify
>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
>>>> .
>>>>
>>>
>>>
>>
>

Re: CSV Support in SparkR

Reply via email to