Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Ángel Álvarez Pascua Mon, 24 Mar 2025 23:46:54 -0700

@Prem Sahoo <[email protected]> ,  could you test both versions of
Spark+Hadoop by replacing your "write to MinIO" statement with
write.format("noop")? This would help us determine whether the issue lies
on the reader side or the writer side.


El dom, 23 mar 2025 a las 4:53, Prem Gmail (<[email protected]>)
escribió:

> V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark 3.2.0
> and Hadoop 2.7.6 but that’s not the case , tried magic committer option
> which is agin more slow . So internally something changed which made this
> slow . May I know ?
> Sent from my iPhone
>
> On Mar 22, 2025, at 11:05 PM, Kristopher Kane <[email protected]> wrote:
>
> 
> We've seen significant performance gains in CSV going from 3.1 -> 3.5.
>
> You've very exactly pointed out the change in fileoutputcommitter.  v1
> (safe, serial, slow) -> v2 (object store unsafe if no atomic rename,
> parallel, faster).  In V1, the output files are moved by the driver
> serially from the staging directory to the final directory.  In V2 they are
> done at the task level.  It's possible the Minio implementation is
> overwhelmed by concurrent inode renames but not likely at 2.6GB.  V2 being
> much, much faster in high performing object stores and HDFS.
>
> The difference in 27 seconds to 34 seconds in Spark can be caused by many
> things and it wouldn't have surfaced on my radar.
>
> Probably an email for the user mailing list.
>
> Kris
>
> On Sat, Mar 22, 2025 at 10:30 PM Prem Sahoo <[email protected]> wrote:
>
>> This is inside my current project , I can’t move data to public domain .
>> But it seems there is something changed which made this slowness .
>> Sent from my iPhone
>>
>> On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua <
>> [email protected]> wrote:
>>
>> 
>> Could you take three thread dumps from one of the executors while Spark
>> is performing the conversion? You can use the Spark UI for that.
>>
>> El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
>> [email protected]>) escribió:
>>
>>> Without the data, it's difficult to analyze. Could you provide some
>>> synthetic data so I can investigate this further? The schema and a few
>>> sample fake rows should be sufficient.
>>>
>>> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<[email protected]>)
>>> escribió:
>>>
>>>> I am providing the schema , and schema is actually correct means it has
>>>> all the columns available in csv . So we can take out this issue for
>>>> slowness .  May be there is some other contributing options .
>>>> Sent from my iPhone
>>>>
>>>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
>>>> [email protected]> wrote:
>>>>
>>>> 
>>>>
>>>> Hey, just this week I found some issues with the Univocity library that
>>>> Spark internally uses to read CSV files.
>>>>
>>>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
>>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>>>>
>>>> I initially assumed this issue had existed since Spark started using
>>>> this library, but perhaps something changed in the versions you mentioned.
>>>>
>>>> Are you providing a schema, or are you letting Spark infer it? I've
>>>> also noticed that when the schema doesn't match the columns in the CSV
>>>> files (for example, different number of columns), exceptions are thrown
>>>> internally.
>>>>
>>>> Given all this, my initial hypothesis is that thousands upon thousands
>>>> of exceptions are being thrown internally, only to be handled by the
>>>> Univocity parser—so the user isn't even aware of what's happening.
>>>>
>>>>
>>>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<[email protected]>)
>>>> escribió:
>>>>
>>>>> Hello ,
>>>>> I read the csv file having size of 2.7 gb which is having 100 columns
>>>>> , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>>>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>>>>> stat is bad .
>>>>> Sent from my iPhone
>>>>>
>>>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>>>>> [email protected]> wrote:
>>>>>
>>>>> 
>>>>> Sure. I love performance challenges and mysteries!
>>>>>
>>>>> Please, could you provide an example project or the steps to build one?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <[email protected]> escribió:
>>>>>
>>>>>> Hello Team,
>>>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>>>>> object storage . It was slower when compared to write to MapR FS with 
>>>>>> above
>>>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter 
>>>>>> and
>>>>>> check ed the performance which is worse than old tech stack. Then tried
>>>>>> using magic committer and it came out slower than V2 so with the latest
>>>>>> tech stack the performance is down graded. Could some please assist .
>>>>>> Sent from my iPhone
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>
>>>>>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to