Re: Rdd - zip with index

ayan guha Tue, 23 Mar 2021 20:24:14 -0700

Best case is use dataframe and df.columns will automatically give you
column names. Are you sure your file is indeed in csv? maybe it is easier
if you share the code?


On Wed, 24 Mar 2021 at 2:12 pm, Sean Owen <sro...@gmail.com> wrote:

> It would split 10GB of CSV into multiple partitions by default, unless
> it's gzipped. Something else is going on here.
>
> ‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎
> <yur...@gmail.com> wrote:‬
>
>> I’m not Spark core developer and do not want to confuse you but it seems
>> logical to me that just reading from single file (no matter what format of
>> the file is used) gives no parallelism unless you do repartition by some
>> column just after csv load, but the if you’re telling you’ve already tried
>> repartition with no luck...
>>
>>
>> > On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>> wrote:
>> >
>> > So spark by default doesn’t split the large 10gb file when loaded?
>> >
>> > Sent from my iPhone
>> >
>> >> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) <
>> yur...@gmail.com> wrote:
>> >>
>> >> Hi, Mohammed
>> >> I think that the reason that only one executor is running and have
>> single partition is because you have single file that might be read/loaded
>> into memory.
>> >>
>> >> In order to achieve better parallelism I’d suggest to split the csv
>> file.
>> >>
>>
>> --
Best Regards,
Ayan Guha

Re: Rdd - zip with index

Reply via email to