The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes between hadoop 2.7.6 to 3.4.1 which made the write process faster.
On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.com> wrote: > @Prem Sahoo <prem.re...@gmail.com> , could you test both versions of > Spark+Hadoop by replacing your "write to MinIO" statement with > write.format("noop")? This would help us determine whether the issue lies > on the reader side or the writer side. > > El dom, 23 mar 2025 a las 4:53, Prem Gmail (<premprakashsa...@gmail.com>) > escribió: > >> V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark >> 3.2.0 and Hadoop 2.7.6 but that’s not the case , tried magic committer >> option which is agin more slow . So internally something changed which made >> this slow . May I know ? >> Sent from my iPhone >> >> On Mar 22, 2025, at 11:05 PM, Kristopher Kane <kk...@etsy.com> wrote: >> >> >> We've seen significant performance gains in CSV going from 3.1 -> 3.5. >> >> You've very exactly pointed out the change in fileoutputcommitter. v1 >> (safe, serial, slow) -> v2 (object store unsafe if no atomic rename, >> parallel, faster). In V1, the output files are moved by the driver >> serially from the staging directory to the final directory. In V2 they are >> done at the task level. It's possible the Minio implementation is >> overwhelmed by concurrent inode renames but not likely at 2.6GB. V2 being >> much, much faster in high performing object stores and HDFS. >> >> The difference in 27 seconds to 34 seconds in Spark can be caused by many >> things and it wouldn't have surfaced on my radar. >> >> Probably an email for the user mailing list. >> >> Kris >> >> On Sat, Mar 22, 2025 at 10:30 PM Prem Sahoo <prem.re...@gmail.com> wrote: >> >>> This is inside my current project , I can’t move data to public domain . >>> But it seems there is something changed which made this slowness . >>> Sent from my iPhone >>> >>> On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua < >>> angel.alvarez.pas...@gmail.com> wrote: >>> >>> >>> Could you take three thread dumps from one of the executors while Spark >>> is performing the conversion? You can use the Spark UI for that. >>> >>> El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (< >>> angel.alvarez.pas...@gmail.com>) escribió: >>> >>>> Without the data, it's difficult to analyze. Could you provide some >>>> synthetic data so I can investigate this further? The schema and a few >>>> sample fake rows should be sufficient. >>>> >>>> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>) >>>> escribió: >>>> >>>>> I am providing the schema , and schema is actually correct means it >>>>> has all the columns available in csv . So we can take out this issue for >>>>> slowness . May be there is some other contributing options . >>>>> Sent from my iPhone >>>>> >>>>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua < >>>>> angel.alvarez.pas...@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> Hey, just this week I found some issues with the Univocity library >>>>> that Spark internally uses to read CSV files. >>>>> >>>>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* >>>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 >>>>> >>>>> I initially assumed this issue had existed since Spark started using >>>>> this library, but perhaps something changed in the versions you mentioned. >>>>> >>>>> Are you providing a schema, or are you letting Spark infer it? I've >>>>> also noticed that when the schema doesn't match the columns in the CSV >>>>> files (for example, different number of columns), exceptions are thrown >>>>> internally. >>>>> >>>>> Given all this, my initial hypothesis is that thousands upon thousands >>>>> of exceptions are being thrown internally, only to be handled by the >>>>> Univocity parser—so the user isn't even aware of what's happening. >>>>> >>>>> >>>>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>) >>>>> escribió: >>>>> >>>>>> Hello , >>>>>> I read the csv file having size of 2.7 gb which is having 100 columns >>>>>> , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it >>>>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This >>>>>> stat is bad . >>>>>> Sent from my iPhone >>>>>> >>>>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua < >>>>>> angel.alvarez.pas...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> Sure. I love performance challenges and mysteries! >>>>>> >>>>>> Please, could you provide an example project or the steps to build >>>>>> one? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> >>>>>> escribió: >>>>>> >>>>>>> Hello Team, >>>>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO >>>>>>> object storage . It was slower when compared to write to MapR FS with >>>>>>> above >>>>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and >>>>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter >>>>>>> and >>>>>>> check ed the performance which is worse than old tech stack. Then tried >>>>>>> using magic committer and it came out slower than V2 so with the latest >>>>>>> tech stack the performance is down graded. Could some please assist . >>>>>>> Sent from my iPhone >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>>