Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Could you take three thread dumps from one of the executors while Spark is performing the conversion? You can use the Spark UI for that. El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (< angel.alvarez.pas...@gmail.com>) escribió: > Without the data, it's difficult to analyze. Could you prov

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
That's why I used the words "synthetic" and "fake" when referring to data. Anyway, the most important thing might be the thread dumps. El dom, 23 mar 2025 a las 3:29, Prem Sahoo () escribió: > This is inside my current project , I can’t move data to public domain . > But it seems there is someth

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
Hello ,I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad . Sent from my iPhoneOn Mar 22, 2025, at 9:21 PM, Ángel Álvare

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Kristopher Kane
We've seen significant performance gains in CSV going from 3.1 -> 3.5. You've very exactly pointed out the change in fileoutputcommitter. v1 (safe, serial, slow) -> v2 (object store unsafe if no atomic rename, parallel, faster). In V1, the output files are moved by the driver serially from the s

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Hey, just this week I found some issues with the Univocity library that Spark internally uses to read CSV files. *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 I initially assumed this issue had existed since Sp

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Without the data, it's difficult to analyze. Could you provide some synthetic data so I can investigate this further? The schema and a few sample fake rows should be sufficient. El dom, 23 mar 2025 a las 3:17, Prem Sahoo () escribió: > I am providing the schema , and schema is actually correct m

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .Sent from my iPhoneOn Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua wrote:Could you take three thread dumps from one of the executors while Spark is perform

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness .  May be there is some other contributing options .Sent from my iPhoneOn Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua wrote:Hey, just this week

Re: [VOTE] Release Spark 4.0.0 (RC3)

2025-03-22 Thread Bobby
Thx @Ángel , I had a PR https://github.com/apache/spark/pull/50334 to fix it. Please help review. Ángel Álvarez Pascua 于2025年3月22日周六 16:27写道: > Is anyone looking into this issue? If not, I'd like to try fixing it. I've > never tried out Spark Connect, so... 2x1! (way better than spending the >

Re: [VOTE] Release Spark 4.0.0 (RC3)

2025-03-22 Thread Ángel Álvarez Pascua
Great to hear that! Fortunately, I was waiting for someone to answer and hadn't looked at it yet. Seems a quite straightforward solution for an issue that shouldn't have to exist if the right unit test had been implemented. What do you think about adding a test to check this issue doesn't happen

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
Sure. I love performance challenges and mysteries! Please, could you provide an example project or the steps to build one? Thanks. El dom, 23 mar 2025, 2:17, Prem Sahoo escribió: > Hello Team, > I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object > storage . It was slower

Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Sahoo
Hello Team, I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object storage . It was slower when compared to write to MapR FS with above tech stack. Then moved on to later upgraded version of Spark 3.5.2 and Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitte

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread serge rielau . com
+1 (non binding) On Mar 21, 2025, at 12:52 PM, Jules Damji wrote: +1 (non-binding) — Sent from my iPhone Pardon the dumb thumb typos :) On Mar 21, 2025, at 11:47 AM, Anton Okolnychyi wrote:  Hi all, I would like to start a vote on adding support for constraints to DSv2. Discussion thread:

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread Anurag Mantripragada
+1 (non-binding) Thanks for working on this Anton! Some links to other engines that also did something similar: HIVE-13076 - https://issues.apache.org/jira/browse/HIVE-13076 IMPALA-3531 - https://issues.apache.org/jira/browse/IMPALA-3531 In fact, Spark had a very old Jira SPARK-19842 - https://

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread Yuming Wang
+1 On Sat, Mar 22, 2025 at 7:01 PM Peter Toth wrote: > +1 > > On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho > wrote: > >> +1 (non binding) >> >> Agree with Anton, data sources like the open table formats define the >> requirement, and definitely need engines to write to it accordingly. >> >> Thank

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread Peter Toth
+1 On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho wrote: > +1 (non binding) > > Agree with Anton, data sources like the open table formats define the > requirement, and definitely need engines to write to it accordingly. > > Thanks, > Szehon > > On Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi > wr

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread DB Tsai
+1Sent from my iPhoneOn Mar 21, 2025, at 2:25 PM, Szehon Ho wrote:+1 (non binding)Agree with Anton, data sources like the open table formats define the requirement, and definitely need engines to write to it accordingly.Thanks,SzehonOn Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi

Re: [VOTE] Release Spark 4.0.0 (RC3)

2025-03-22 Thread Ángel Álvarez Pascua
Is anyone looking into this issue? If not, I'd like to try fixing it. I've never tried out Spark Connect, so... 2x1! (way better than spending the weekend binge-watching on Netflix 😅🤣). @Bobby , thanks a lot, not only for reporting the issue but also for providing a time-saving project for testin

Re: [VOTE] SPIP: Constraints in DSv2

2025-03-22 Thread Ángel Álvarez Pascua
One thing is enforcing the quality of the data Spark is producing, and another thing entirely is defining an external data model from Spark. The proposal doesn’t necessarily facilitate data accuracy and consistency. Defining constraints does help with that, but the question remains: Is Spark trul