Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Ángel Álvarez Pascua Sat, 22 Mar 2025 19:40:14 -0700

 Without the data, it's difficult to analyze. Could you provide some
synthetic data so I can investigate this further? The schema and a few
sample fake rows should be sufficient.


El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<[email protected]>)
escribió:

> I am providing the schema , and schema is actually correct means it has
> all the columns available in csv . So we can take out this issue for
> slowness .  May be there is some other contributing options .
> Sent from my iPhone
>
> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
> [email protected]> wrote:
>
> 
>
> Hey, just this week I found some issues with the Univocity library that
> Spark internally uses to read CSV files.
>
> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>
> I initially assumed this issue had existed since Spark started using this
> library, but perhaps something changed in the versions you mentioned.
>
> Are you providing a schema, or are you letting Spark infer it? I've also
> noticed that when the schema doesn't match the columns in the CSV files
> (for example, different number of columns), exceptions are thrown
> internally.
>
> Given all this, my initial hypothesis is that thousands upon thousands of
> exceptions are being thrown internally, only to be handled by the Univocity
> parser—so the user isn't even aware of what's happening.
>
>
> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<[email protected]>)
> escribió:
>
>> Hello ,
>> I read the csv file having size of 2.7 gb which is having 100 columns ,
>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>> stat is bad .
>> Sent from my iPhone
>>
>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>> [email protected]> wrote:
>>
>> 
>> Sure. I love performance challenges and mysteries!
>>
>> Please, could you provide an example project or the steps to build one?
>>
>> Thanks.
>>
>> El dom, 23 mar 2025, 2:17, Prem Sahoo <[email protected]> escribió:
>>
>>> Hello Team,
>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>> object storage . It was slower when compared to write to MapR FS with above
>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and
>>> check ed the performance which is worse than old tech stack. Then tried
>>> using magic committer and it came out slower than V2 so with the latest
>>> tech stack the performance is down graded. Could some please assist .
>>> Sent from my iPhone
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to