Re: Rdd - zip with index

KhajaAsmath Mohammed Wed, 24 Mar 2021 09:05:40 -0700

Thanks Mich. I understood what I am supposed to do now, will try these
options.


I still dont understand how the spark will split the large file. I have a
10 GB file which I want to split automatically after reading. I can split
and load the file before reading but it is a very big requirement change
for all our data pipeline.

Is there a way to split the file once it is read to achieve parallelism ?
I will group groupby on one column to see if that improves my job.

On Wed, Mar 24, 2021 at 10:56 AM Mich Talebzadeh <[email protected]>
wrote:

> How does Spark establish there is a csv header as a matter of interest?
>
> Example
>
> val df = spark.read.option("header", true).csv(location)
>
> I need to tell spark to ignore the header correct?
>
> From Spark Read CSV file into DataFrame — SparkByExamples
> <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/>
>
> If you have a header with column names on file, you need to explicitly
> specify true for header option using option("header",true)
> <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/#header>
>  not
> mentioning this, the API treats header as a data record.
>
> Second point which may not be applicable to the newer versions of Spark. My
> understanding is that the gz file is not splittable, therefore Spark needs
> to read the whole file using a single core which will slow things down (CPU
> intensive). After the read is done the data can be shuffled to increase
> parallelism.
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 24 Mar 2021 at 12:40, Sean Owen <[email protected]> wrote:
>
>> No need to do that. Reading the header with Spark automatically is
>> trivial.
>>
>> On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> If it is a csv then it is a flat file somewhere in a directory I guess.
>>>
>>> Get the header out by doing
>>>
>>> */usr/bin/zcat csvfile.gz |head -n 1*
>>> Title Number,Tenure,Property
>>> Address,District,County,Region,Postcode,Multiple Address Indicator,Price
>>> Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship
>>> Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor
>>> (1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company
>>> Registration No. (2),Proprietorship Category (2),Country Incorporated
>>> (2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2)
>>> Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship
>>> Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor
>>> (3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company
>>> Registration No. (4),Proprietorship Category (4),Country Incorporated
>>> (4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4)
>>> Address (3),Date Proprietor Added,Additional Proprietor Indicator
>>>
>>>
>>> 10GB is not much of a big CSV file
>>>
>>> that will resolve the header anyway.
>>>
>>>
>>> Also how are you running the spark, in a local mode (single jvm) or
>>> other distributed modes (yarn, standalone) ?
>>>
>>>
>>> HTH
>>>
>>

Re: Rdd - zip with index

Reply via email to