Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-04-04 Thread Steve Loughran
Create the JIRA and we can look at it. if it is just write performance, then I am confident that that hadoo 3.4.1 is way faster writing code, with some extra parameters available to make things faster. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html#Options_to_T

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-31 Thread Steve Loughran
1. mino does actually have atomic object relenames, but as it is file by file, task commit is nonatomic; 2. v2 task commit is also unsafe -it just writes to the destination. There is no way committer which supports task failure can be as fast as this. further reading https://github.

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-25 Thread Prem Sahoo
Just one more variable is Spark 3.5.2 runs on kubernetes and Spark 3.2.0 runs on YARN . It seems kubernetes can be a cause of slowness too .Sent from my iPhoneOn Mar 24, 2025, at 7:10 PM, Prem Gmail wrote:Hello Spark Dev/users,Any one has any clue why and how a better version have performance iss

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Ángel Álvarez Pascua
@Prem Sahoo , could you test both versions of Spark+Hadoop by replacing your "write to MinIO" statement with write.format("noop")? This would help us determine whether the issue lies on the reader side or the writer side. El dom, 23 mar 2025 a las 4:53, Prem Gmail () escribió: > V2 writer in 3.

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Prem Sahoo
The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes between hadoop 2.7.6 to 3.4.1 which made the write process faster. On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.c

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Could you take three thread dumps from one of the executors while Spark is performing the conversion? You can use the Spark UI for that. El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (< angel.alvarez.pas...@gmail.com>) escribió: > Without the data, it's difficult to analyze. Could you prov

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
That's why I used the words "synthetic" and "fake" when referring to data. Anyway, the most important thing might be the thread dumps. El dom, 23 mar 2025 a las 3:29, Prem Sahoo () escribió: > This is inside my current project , I can’t move data to public domain . > But it seems there is someth

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
Hello ,I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad . Sent from my iPhoneOn Mar 22, 2025, at 9:21 PM, Ángel Álvare

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Kristopher Kane
We've seen significant performance gains in CSV going from 3.1 -> 3.5. You've very exactly pointed out the change in fileoutputcommitter. v1 (safe, serial, slow) -> v2 (object store unsafe if no atomic rename, parallel, faster). In V1, the output files are moved by the driver serially from the s

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Hey, just this week I found some issues with the Univocity library that Spark internally uses to read CSV files. *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 I initially assumed this issue had existed since Sp

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Without the data, it's difficult to analyze. Could you provide some synthetic data so I can investigate this further? The schema and a few sample fake rows should be sufficient. El dom, 23 mar 2025 a las 3:17, Prem Sahoo () escribió: > I am providing the schema , and schema is actually correct m

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .Sent from my iPhoneOn Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua wrote:Could you take three thread dumps from one of the executors while Spark is perform

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness .  May be there is some other contributing options .Sent from my iPhoneOn Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua wrote:Hey, just this week

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Sure. I love performance challenges and mysteries! Please, could you provide an example project or the steps to build one? Thanks. El dom, 23 mar 2025, 2:17, Prem Sahoo escribió: > Hello Team, > I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object > storage . It was slower