On 19 Nov 2016, at 17:21, vr spark <vrspark...@gmail.com<mailto:vrspark...@gmail.com>> wrote:
Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on distributed cloud storage thanks VR all options, 9 lines of code, assuming a spark context has already been setup with the permissions to write to AWS, and the relevant JARs for S3A to work on the CP. The read operation is inefficient as to determine the schema it scans the (here, remote) file twice; that may be OK for an example, but I wouldn't do that in production. The source is a real file belonging to amazon; dest a bucket of mine. More details at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores val csvdata = spark.read.options(Map( "header" -> "true", "ignoreLeadingWhiteSpace" -> "true", "ignoreTrailingWhiteSpace" -> "true", "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz") csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")