Interestingly, after more digging, df.printSchema() in raw spark shows the
columns as a long, not a bigint.
root
|-- localEventDtTm: timestamp (nullable = true)
|-- asset: string (nullable = true)
|-- assetCategory: string (nullable = true)
|-- assetType: string (nullable = true)
|-- event: s
Hi Folks,
Using Spark to read in JSON files and detect the schema, it gives me a
dataframe with a "bigint" filed. R then fails to import the dataframe as it
cant convert the type.
> head(mydf)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""jobj"" to a data.fram
FWIW, I had some trouble getting Spark running on a Pi.
My core problem was using snappy for compression as it comes as a pre-made
binary for i386 and I couldnt find one for ARM.
So to work around it there was an option to use LZO instead, then everything
worked.
Off the top of my head, it was
Not sure if this helps, but the options I set are slightly different:
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId","key")
hadoopConf.set("fs.s3n.awsSecretAccessKey","secret")
Try setting them to s3n as opposed to just s3
Good luck!
--
View this message in conte
Just to add to this, here's some more info:
val myDF = hiveContext.read.parquet("s3n://myBucket/myPath/")
Produces these...
2015-07-01 03:25:50,450 INFO [pool-14-thread-4]
(org.apache.hadoop.fs.s3native.NativeS3FileSystem) - Opening
's3n://myBucket/myPath/part-r-00339.parquet' for reading
That
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
"partition" data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned.
eg:
val myDataFrame =
hiveContext.read.parquet("s3n://my
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
"partition" data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned.
eg:
val myDataFrame =
hiveContext.read.parquet("s3n://myBu
Hi Folks,
I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so
far is the data frame reader/writer. Previously:
val myData = hiveContext.load("s3n://someBucket/somePath/","parquet")
Now:
val myData = hiveContext.read.parquet("s3n://someBucket/somePath")
Using the ori
Hello Bright Sparks,
I was using Spark 1.3.0 to push data out to Parquet files. They have been
working great, super fast, easy way to persist data frames etc.
However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1.
I unzipped it, copied my config over and then went to read one