Re: Rdd - zip with index

2021-03-25 Thread Mich Talebzadeh
Pretty easy if you do it efficiently gunzip --to-stdout csvfile.gz | bzip2 > csvfile.bz2 Just create a simple bash file to do it and print timings cat convert_file.sh #!/bin/bash GZFILE="csvfile.gz" FILE_NAME=`basename $GZFILE .gz` BZFILE="$FILE_NAME.bz2" echo `date` " ""=== Started comp

Re: Rdd - zip with index

2021-03-25 Thread KhajaAsmath Mohammed
Hi Mich, Yes you are right. We were getting gz files and this is causing the issue. I will be changing it to bzip or other splittable formats and try running it again today. Thanks, Asmath Sent from my iPhone > On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh > wrote: > >  > Hi Asmath, > >

Re: Rdd - zip with index

2021-03-25 Thread Mich Talebzadeh
Hi Asmath, Have you actually managed to run this single file? Because Spark (as brought up a few times already) will pull the whole of the GZ file in a single partition in the driver, and can get an out of memory error. HTH view my Linkedin profile

Re: Rdd - zip with index

2021-03-24 Thread ayan guha
Hi "I still dont understand how the spark will split the large file" -- This is actually achieved by something called InputFormat, which in turn depends on what type of file it is and what is the block size. Ex: If you have block size of 64MB, then a 10GB file will roughly translate to 10240/64 =

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
Right, that's all you do to tell it to treat the first line of the files as a header defining col names. Yes, .gz files still aren't splittable by nature. One huge CSV .csv file would be split into partitions, but one .gz file would not, which can be a problem. To be clear, you do not need to do an

Re: Rdd - zip with index

2021-03-24 Thread KhajaAsmath Mohammed
Thanks Mich. I understood what I am supposed to do now, will try these options. I still dont understand how the spark will split the large file. I have a 10 GB file which I want to split automatically after reading. I can split and load the file before reading but it is a very big requirement chan

Re: Rdd - zip with index

2021-03-24 Thread Mich Talebzadeh
How does Spark establish there is a csv header as a matter of interest? Example val df = spark.read.option("header", true).csv(location) I need to tell spark to ignore the header correct? >From Spark Read CSV file into DataFrame — SparkByExamples

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
No need to do that. Reading the header with Spark automatically is trivial. On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh wrote: > If it is a csv then it is a flat file somewhere in a directory I guess. > > Get the header out by doing > > */usr/bin/zcat csvfile.gz |head -n 1* > Title Number,Te

Re: Rdd - zip with index

2021-03-24 Thread Mich Talebzadeh
If it is a csv then it is a flat file somewhere in a directory I guess. Get the header out by doing */usr/bin/zcat csvfile.gz |head -n 1* Title Number,Tenure,Property Address,District,County,Region,Postcode,Multiple Address Indicator,Price Paid,Proprietor Name (1),Company Registration No. (1),Pro

Re: Rdd - zip with index

2021-03-23 Thread ayan guha
Best case is use dataframe and df.columns will automatically give you column names. Are you sure your file is indeed in csv? maybe it is easier if you share the code? On Wed, 24 Mar 2021 at 2:12 pm, Sean Owen wrote: > It would split 10GB of CSV into multiple partitions by default, unless > it's

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
It would split 10GB of CSV into multiple partitions by default, unless it's gzipped. Something else is going on here. ‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > I’m not Spark core developer and do not want to confuse you but it seems >

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
I’m not Spark core developer and do not want to confuse you but it seems logical to me that just reading from single file (no matter what format of the file is used) gives no parallelism unless you do repartition by some column just after csv load, but the if you’re telling you’ve already tried

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
I don't think that would change partitioning? try .repartition(). It isn't necessary to write it out let alone in Avro. ‪On Tue, Mar 23, 2021 at 8:45 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > Hi, Mohammed > I think that the reason that only one executor is running

Re: Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
So spark by default doesn’t split the large 10gb file when loaded? Sent from my iPhone > On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) > wrote: > > Hi, Mohammed > I think that the reason that only one executor is running and have single > partition is because you have si

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi, Mohammed I think that the reason that only one executor is running and have single partition is because you have single file that might be read/loaded into memory. In order to achieve better parallelism I’d suggest to split the csv file. Another problem is question: why are you using rdd? J

Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
Hi, I have 10gb file that should be loaded into spark dataframe. This file is csv with header and we were using rdd.zipwithindex to get column names and convert to avro accordingly. I am assuming this is taking long time and only executor runs and never achieves parallelism. Is there a easy w