Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it?
On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 19 Nov 2016, at 17:21, vr spark <vrspark...@gmail.com> wrote: > > Hi, > I am looking for scala or python code samples to covert local tsv file to > orc file and store on distributed cloud storage(openstack). > > So, need these 3 samples. Please suggest. > > 1. read tsv > 2. convert to orc > 3. store on distributed cloud storage > > > thanks > VR > > > all options, 9 lines of code, assuming a spark context has already been > setup with the permissions to write to AWS, and the relevant JARs for S3A > to work on the CP. The read operation is inefficient as to determine the > schema it scans the (here, remote) file twice; that may be OK for an > example, but I wouldn't do that in production. The source is a real file > belonging to amazon; dest a bucket of mine. > > More details at: http://www.slideshare.net/steve_l/apache-spark-and- > object-stores > > > val csvdata = spark.read.options(Map( > "header" -> "true", > "ignoreLeadingWhiteSpace" -> "true", > "ignoreTrailingWhiteSpace" -> "true", > "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ", > "inferSchema" -> "true", > "mode" -> "FAILFAST")) > .csv("s3a://landsat-pds/scene_list.gz") > csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc") >