Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Prem Sahoo Mon, 24 Mar 2025 13:23:27 -0700

The problem is on the writer's side. It takes longer to write to Minio with
Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes
between hadoop 2.7.6 to 3.4.1 which made the write process faster.


On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua <
[email protected]> wrote:

> @Prem Sahoo <[email protected]> ,  could you test both versions of
> Spark+Hadoop by replacing your "write to MinIO" statement with
> write.format("noop")? This would help us determine whether the issue lies
> on the reader side or the writer side.
>
> El dom, 23 mar 2025 a las 4:53, Prem Gmail (<[email protected]>)
> escribió:
>
>> V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark
>> 3.2.0 and Hadoop 2.7.6 but that’s not the case , tried magic committer
>> option which is agin more slow . So internally something changed which made
>> this slow . May I know ?
>> Sent from my iPhone
>>
>> On Mar 22, 2025, at 11:05 PM, Kristopher Kane <[email protected]> wrote:
>>
>> 
>> We've seen significant performance gains in CSV going from 3.1 -> 3.5.
>>
>> You've very exactly pointed out the change in fileoutputcommitter.  v1
>> (safe, serial, slow) -> v2 (object store unsafe if no atomic rename,
>> parallel, faster).  In V1, the output files are moved by the driver
>> serially from the staging directory to the final directory.  In V2 they are
>> done at the task level.  It's possible the Minio implementation is
>> overwhelmed by concurrent inode renames but not likely at 2.6GB.  V2 being
>> much, much faster in high performing object stores and HDFS.
>>
>> The difference in 27 seconds to 34 seconds in Spark can be caused by many
>> things and it wouldn't have surfaced on my radar.
>>
>> Probably an email for the user mailing list.
>>
>> Kris
>>
>> On Sat, Mar 22, 2025 at 10:30 PM Prem Sahoo <[email protected]> wrote:
>>
>>> This is inside my current project , I can’t move data to public domain .
>>> But it seems there is something changed which made this slowness .
>>> Sent from my iPhone
>>>
>>> On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua <
>>> [email protected]> wrote:
>>>
>>> 
>>> Could you take three thread dumps from one of the executors while Spark
>>> is performing the conversion? You can use the Spark UI for that.
>>>
>>> El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
>>> [email protected]>) escribió:
>>>
>>>> Without the data, it's difficult to analyze. Could you provide some
>>>> synthetic data so I can investigate this further? The schema and a few
>>>> sample fake rows should be sufficient.
>>>>
>>>> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<[email protected]>)
>>>> escribió:
>>>>
>>>>> I am providing the schema , and schema is actually correct means it
>>>>> has all the columns available in csv . So we can take out this issue for
>>>>> slowness .  May be there is some other contributing options .
>>>>> Sent from my iPhone
>>>>>
>>>>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
>>>>> [email protected]> wrote:
>>>>>
>>>>> 
>>>>>
>>>>> Hey, just this week I found some issues with the Univocity library
>>>>> that Spark internally uses to read CSV files.
>>>>>
>>>>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
>>>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>>>>>
>>>>> I initially assumed this issue had existed since Spark started using
>>>>> this library, but perhaps something changed in the versions you mentioned.
>>>>>
>>>>> Are you providing a schema, or are you letting Spark infer it? I've
>>>>> also noticed that when the schema doesn't match the columns in the CSV
>>>>> files (for example, different number of columns), exceptions are thrown
>>>>> internally.
>>>>>
>>>>> Given all this, my initial hypothesis is that thousands upon thousands
>>>>> of exceptions are being thrown internally, only to be handled by the
>>>>> Univocity parser—so the user isn't even aware of what's happening.
>>>>>
>>>>>
>>>>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<[email protected]>)
>>>>> escribió:
>>>>>
>>>>>> Hello ,
>>>>>> I read the csv file having size of 2.7 gb which is having 100 columns
>>>>>> , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>>>>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>>>>>> stat is bad .
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> 
>>>>>> Sure. I love performance challenges and mysteries!
>>>>>>
>>>>>> Please, could you provide an example project or the steps to build
>>>>>> one?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <[email protected]>
>>>>>> escribió:
>>>>>>
>>>>>>> Hello Team,
>>>>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>>>>>> object storage . It was slower when compared to write to MapR FS with 
>>>>>>> above
>>>>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>>>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter 
>>>>>>> and
>>>>>>> check ed the performance which is worse than old tech stack. Then tried
>>>>>>> using magic committer and it came out slower than V2 so with the latest
>>>>>>> tech stack the performance is down graded. Could some please assist .
>>>>>>> Sent from my iPhone
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>
>>>>>>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to