On 19 Nov 2016, at 17:21, vr spark 
<vrspark...@gmail.com<mailto:vrspark...@gmail.com>> wrote:

Hi,
I am looking for scala or python code samples to covert local tsv file to orc 
file and store on distributed cloud storage(openstack).

So, need these 3 samples. Please suggest.

1. read tsv
2. convert to orc
3. store on distributed cloud storage


thanks
VR

all options, 9 lines of code, assuming a spark context has already been setup 
with the permissions to write to AWS, and the relevant JARs for S3A to work on 
the CP. The read operation is inefficient as to determine the schema it scans 
the (here, remote) file twice; that may be OK for an example, but I wouldn't do 
that in production. The source is a real file belonging to amazon; dest a 
bucket of mine.

More details at: 
http://www.slideshare.net/steve_l/apache-spark-and-object-stores


val csvdata = spark.read.options(Map(
  "header" -> "true",
  "ignoreLeadingWhiteSpace" -> "true",
  "ignoreTrailingWhiteSpace" -> "true",
  "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
  "inferSchema" -> "true",
  "mode" -> "FAILFAST"))
    .csv("s3a://landsat-pds/scene_list.gz")
csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")

Reply via email to