Hi, The source file i have is on local machine and its pretty huge like 150
gb.  How to go about it?

On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 19 Nov 2016, at 17:21, vr spark <vrspark...@gmail.com> wrote:
>
> Hi,
> I am looking for scala or python code samples to covert local tsv file to
> orc file and store on distributed cloud storage(openstack).
>
> So, need these 3 samples. Please suggest.
>
> 1. read tsv
> 2. convert to orc
> 3. store on distributed cloud storage
>
>
> thanks
> VR
>
>
> all options, 9 lines of code, assuming a spark context has already been
> setup with the permissions to write to AWS, and the relevant JARs for S3A
> to work on the CP. The read operation is inefficient as to determine the
> schema it scans the (here, remote) file twice; that may be OK for an
> example, but I wouldn't do that in production. The source is a real file
> belonging to amazon; dest a bucket of mine.
>
> More details at: http://www.slideshare.net/steve_l/apache-spark-and-
> object-stores
>
>
> val csvdata = spark.read.options(Map(
>   "header" -> "true",
>   "ignoreLeadingWhiteSpace" -> "true",
>   "ignoreTrailingWhiteSpace" -> "true",
>   "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
>   "inferSchema" -> "true",
>   "mode" -> "FAILFAST"))
>     .csv("s3a://landsat-pds/scene_list.gz")
> csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")
>

Reply via email to