Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Prem Sahoo Sat, 22 Mar 2025 19:29:57 -0700

This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .

Sent from my iPhone

On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua <[email protected]> wrote:

Could you take three thread dumps from one of the executors while Spark is performing the conversion? You can use the Spark UI for that.

El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<[email protected]>) escribió:
Without the data, it's difficult to analyze. Could you provide some synthetic data so I can investigate this further? The schema and a few sample fake rows should be sufficient.

El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<[email protected]>) escribió:
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness . May be there is some other contributing options .
Sent from my iPhone

On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <[email protected]> wrote:

Hey, just this week I found some issues with the Univocity library that Spark internally uses to read CSV files.

Spark CSV Read Low Performance: EOFExceptions in Univocity Parser
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579

I initially assumed this issue had existed since Spark started using this library, but perhaps something changed in the versions you mentioned.

Are you providing a schema, or are you letting Spark infer it? I've also noticed that when the schema doesn't match the columns in the CSV files (for example, different number of columns), exceptions are thrown internally.

Given all this, my initial hypothesis is that thousands upon thousands of exceptions are being thrown internally, only to be handled by the Univocity parser—so the user isn't even aware of what's happening.

El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<[email protected]>) escribió:
Hello ,
I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad .
Sent from my iPhone

On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <[email protected]> wrote:

Sure. I love performance challenges and mysteries!

Please, could you provide an example project or the steps to build one?

Thanks.

El dom, 23 mar 2025, 2:17, Prem Sahoo <[email protected]> escribió:
Hello Team,
I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object storage . It was slower when compared to write to MapR FS with above tech stack. Then moved on to later upgraded version of Spark 3.5.2 and Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and check ed the performance which is worse than old tech stack. Then tried using magic committer and it came out slower than V2 so with the latest tech stack the performance is down graded. Could some please assist .
Sent from my iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to