Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Prem Sahoo Tue, 25 Mar 2025 05:16:28 -0700

Just one more variable is Spark 3.5.2 runs on kubernetes and Spark 3.2.0 runs on YARN . It seems kubernetes can be a cause of slowness too .

Sent from my iPhone

On Mar 24, 2025, at 7:10 PM, Prem Gmail <premprakashsa...@gmail.com> wrote:

Hello Spark Dev/users,
Any one has any clue why and how a better version have performance issue .

I will be happy to raise JIRA .
Sent from my iPhone

On Mar 24, 2025, at 4:20 PM, Prem Sahoo <prem.re...@gmail.com> wrote:

The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes between hadoop 2.7.6 to 3.4.1 which made the write process faster.

On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:
@Prem Sahoo , could you test both versions of Spark+Hadoop by replacing your "write to MinIO" statement with write.format("noop")? This would help us determine whether the issue lies on the reader side or the writer side.

El dom, 23 mar 2025 a las 4:53, Prem Gmail (<premprakashsa...@gmail.com>) escribió:
V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark 3.2.0 and Hadoop 2.7.6 but that’s not the case , tried magic committer option which is agin more slow . So internally something changed which made this slow . May I know ?
Sent from my iPhone

On Mar 22, 2025, at 11:05 PM, Kristopher Kane <kk...@etsy.com> wrote:

We've seen significant performance gains in CSV going from 3.1 -> 3.5.

You've very exactly pointed out the change in fileoutputcommitter. v1 (safe, serial, slow) -> v2 (object store unsafe if no atomic rename, parallel, faster). In V1, the output files are moved by the driver serially from the staging directory to the final directory. In V2 they are done at the task level. It's possible the Minio implementation is overwhelmed by concurrent inode renames but not likely at 2.6GB. V2 being much, much faster in high performing object stores and HDFS.

The difference in 27 seconds to 34 seconds in Spark can be caused by many things and it wouldn't have surfaced on my radar.

Probably an email for the user mailing list.

Kris

On Sat, Mar 22, 2025 at 10:30 PM Prem Sahoo <prem.re...@gmail.com> wrote:
This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .
Sent from my iPhone

On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:

Could you take three thread dumps from one of the executors while Spark is performing the conversion? You can use the Spark UI for that.

El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<angel.alvarez.pas...@gmail.com>) escribió:
Without the data, it's difficult to analyze. Could you provide some synthetic data so I can investigate this further? The schema and a few sample fake rows should be sufficient.

El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>) escribió:
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness . May be there is some other contributing options .
Sent from my iPhone

On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:

Hey, just this week I found some issues with the Univocity library that Spark internally uses to read CSV files.

Spark CSV Read Low Performance: EOFExceptions in Univocity Parser
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579

I initially assumed this issue had existed since Spark started using this library, but perhaps something changed in the versions you mentioned.

Are you providing a schema, or are you letting Spark infer it? I've also noticed that when the schema doesn't match the columns in the CSV files (for example, different number of columns), exceptions are thrown internally.

Given all this, my initial hypothesis is that thousands upon thousands of exceptions are being thrown internally, only to be handled by the Univocity parser—so the user isn't even aware of what's happening.

El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>) escribió:
Hello ,
I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad .
Sent from my iPhone

On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:

Sure. I love performance challenges and mysteries!

Please, could you provide an example project or the steps to build one?

Thanks.

El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió:
Hello Team,
I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object storage . It was slower when compared to write to MapR FS with above tech stack. Then moved on to later upgraded version of Spark 3.5.2 and Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and check ed the performance which is worse than old tech stack. Then tried using magic committer and it came out slower than V2 so with the latest tech stack the performance is down graded. Could some please assist .
Sent from my iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to