Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Ángel Álvarez Pascua Sat, 22 Mar 2025 22:36:41 -0700

 That's why I used the words "synthetic" and "fake" when referring to data.
Anyway, the most important thing might be the thread dumps.


El dom, 23 mar 2025 a las 3:29, Prem Sahoo (<prem.re...@gmail.com>)
escribió:

> This is inside my current project , I can’t move data to public domain .
> But it seems there is something changed which made this slowness .
> Sent from my iPhone
>
> On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> wrote:
>
> 
> Could you take three thread dumps from one of the executors while Spark is
> performing the conversion? You can use the Spark UI for that.
>
> El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
> angel.alvarez.pas...@gmail.com>) escribió:
>
>> Without the data, it's difficult to analyze. Could you provide some
>> synthetic data so I can investigate this further? The schema and a few
>> sample fake rows should be sufficient.
>>
>> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>)
>> escribió:
>>
>>> I am providing the schema , and schema is actually correct means it has
>>> all the columns available in csv . So we can take out this issue for
>>> slowness .  May be there is some other contributing options .
>>> Sent from my iPhone
>>>
>>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com> wrote:
>>>
>>> 
>>>
>>> Hey, just this week I found some issues with the Univocity library that
>>> Spark internally uses to read CSV files.
>>>
>>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>>>
>>> I initially assumed this issue had existed since Spark started using
>>> this library, but perhaps something changed in the versions you mentioned.
>>>
>>> Are you providing a schema, or are you letting Spark infer it? I've also
>>> noticed that when the schema doesn't match the columns in the CSV files
>>> (for example, different number of columns), exceptions are thrown
>>> internally.
>>>
>>> Given all this, my initial hypothesis is that thousands upon thousands
>>> of exceptions are being thrown internally, only to be handled by the
>>> Univocity parser—so the user isn't even aware of what's happening.
>>>
>>>
>>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>)
>>> escribió:
>>>
>>>> Hello ,
>>>> I read the csv file having size of 2.7 gb which is having 100 columns ,
>>>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>>>> stat is bad .
>>>> Sent from my iPhone
>>>>
>>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>>>> angel.alvarez.pas...@gmail.com> wrote:
>>>>
>>>> 
>>>> Sure. I love performance challenges and mysteries!
>>>>
>>>> Please, could you provide an example project or the steps to build one?
>>>>
>>>> Thanks.
>>>>
>>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió:
>>>>
>>>>> Hello Team,
>>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>>>> object storage . It was slower when compared to write to MapR FS with 
>>>>> above
>>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter 
>>>>> and
>>>>> check ed the performance which is worse than old tech stack. Then tried
>>>>> using magic committer and it came out slower than V2 so with the latest
>>>>> tech stack the performance is down graded. Could some please assist .
>>>>> Sent from my iPhone
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to