Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Ángel Álvarez Pascua Sat, 22 Mar 2025 19:44:55 -0700

Hey, just this week I found some issues with the Univocity library that
Spark internally uses to read CSV files.

*Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579

I initially assumed this issue had existed since Spark started using this
library, but perhaps something changed in the versions you mentioned.

Are you providing a schema, or are you letting Spark infer it? I've also
noticed that when the schema doesn't match the columns in the CSV files
(for example, different number of columns), exceptions are thrown
internally.

Given all this, my initial hypothesis is that thousands upon thousands of
exceptions are being thrown internally, only to be handled by the Univocity
parser—so the user isn't even aware of what's happening.

El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>)
escribió:

> Hello ,
> I read the csv file having size of 2.7 gb which is having 100 columns ,
> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
> stat is bad .
> Sent from my iPhone
>
> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> wrote:
>
> 
> Sure. I love performance challenges and mysteries!
>
> Please, could you provide an example project or the steps to build one?
>
> Thanks.
>
> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió:
>
>> Hello Team,
>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object
>> storage . It was slower when compared to write to MapR FS with above tech
>> stack. Then moved on to later upgraded version of Spark 3.5.2 and Hadoop
>> 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and check
>> ed the performance which is worse than old tech stack. Then tried using
>> magic committer and it came out slower than V2 so with the latest tech
>> stack the performance is down graded. Could some please assist .
>> Sent from my iPhone
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to