Re: SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Exie
Interestingly, after more digging, df.printSchema() in raw spark shows the columns as a long, not a bigint. root |-- localEventDtTm: timestamp (nullable = true) |-- asset: string (nullable = true) |-- assetCategory: string (nullable = true) |-- assetType: string (nullable = true) |-- event: s

SparkR Supported Types - Please add "bigint"

2015-07-23 Thread Exie
Hi Folks, Using Spark to read in JSON files and detect the schema, it gives me a dataframe with a "bigint" filed. R then fails to import the dataframe as it cant convert the type. > head(mydf) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""jobj"" to a data.fram

Re: Spark run errors on Raspberry Pi

2015-06-30 Thread Exie
FWIW, I had some trouble getting Spark running on a Pi. My core problem was using snappy for compression as it comes as a pre-made binary for i386 and I couldnt find one for ARM. So to work around it there was an option to use LZO instead, then everything worked. Off the top of my head, it was

Re: s3 bucket access/read file

2015-06-30 Thread Exie
Not sure if this helps, but the options I set are slightly different: val hadoopConf=sc.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId","key") hadoopConf.set("fs.s3n.awsSecretAccessKey","secret") Try setting them to s3n as opposed to just s3 Good luck! -- View this message in conte

Re: Spark 1.4.0: read.df() causes excessive IO

2015-06-30 Thread Exie
Just to add to this, here's some more info: val myDF = hiveContext.read.parquet("s3n://myBucket/myPath/") Produces these... 2015-07-01 03:25:50,450 INFO [pool-14-thread-4] (org.apache.hadoop.fs.s3native.NativeS3FileSystem) - Opening 's3n://myBucket/myPath/part-r-00339.parquet' for reading That

Spark 1.4.0: Parquet partitions / folder hierarchy changed from 1.3.1

2015-06-30 Thread Exie
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would "partition" data into folders. So I set up some parquet data paritioned by date. This enabled is to reference a single day/month/year minimizing how much data was scanned. eg: val myDataFrame = hiveContext.read.parquet("s3n://my

1.4.0

2015-06-30 Thread Exie
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would "partition" data into folders. So I set up some parquet data paritioned by date. This enabled is to reference a single day/month/year minimizing how much data was scanned. eg: val myDataFrame = hiveContext.read.parquet("s3n://myBu

Spark 1.4.0: read.df() causes excessive IO

2015-06-29 Thread Exie
Hi Folks, I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so far is the data frame reader/writer. Previously: val myData = hiveContext.load("s3n://someBucket/somePath/","parquet") Now: val myData = hiveContext.read.parquet("s3n://someBucket/somePath") Using the ori

Spark 1.3.0 -> 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-14 Thread Exie
Hello Bright Sparks, I was using Spark 1.3.0 to push data out to Parquet files. They have been working great, super fast, easy way to persist data frames etc. However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1. I unzipped it, copied my config over and then went to read one