That's why I used the words "synthetic" and "fake" when referring to data. Anyway, the most important thing might be the thread dumps.
El dom, 23 mar 2025 a las 3:29, Prem Sahoo (<prem.re...@gmail.com>) escribió: > This is inside my current project , I can’t move data to public domain . > But it seems there is something changed which made this slowness . > Sent from my iPhone > > On Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > > > Could you take three thread dumps from one of the executors while Spark is > performing the conversion? You can use the Spark UI for that. > > El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (< > angel.alvarez.pas...@gmail.com>) escribió: > >> Without the data, it's difficult to analyze. Could you provide some >> synthetic data so I can investigate this further? The schema and a few >> sample fake rows should be sufficient. >> >> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>) >> escribió: >> >>> I am providing the schema , and schema is actually correct means it has >>> all the columns available in csv . So we can take out this issue for >>> slowness . May be there is some other contributing options . >>> Sent from my iPhone >>> >>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua < >>> angel.alvarez.pas...@gmail.com> wrote: >>> >>> >>> >>> Hey, just this week I found some issues with the Univocity library that >>> Spark internally uses to read CSV files. >>> >>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* >>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 >>> >>> I initially assumed this issue had existed since Spark started using >>> this library, but perhaps something changed in the versions you mentioned. >>> >>> Are you providing a schema, or are you letting Spark infer it? I've also >>> noticed that when the schema doesn't match the columns in the CSV files >>> (for example, different number of columns), exceptions are thrown >>> internally. >>> >>> Given all this, my initial hypothesis is that thousands upon thousands >>> of exceptions are being thrown internally, only to be handled by the >>> Univocity parser—so the user isn't even aware of what's happening. >>> >>> >>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>) >>> escribió: >>> >>>> Hello , >>>> I read the csv file having size of 2.7 gb which is having 100 columns , >>>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it >>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This >>>> stat is bad . >>>> Sent from my iPhone >>>> >>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua < >>>> angel.alvarez.pas...@gmail.com> wrote: >>>> >>>> >>>> Sure. I love performance challenges and mysteries! >>>> >>>> Please, could you provide an example project or the steps to build one? >>>> >>>> Thanks. >>>> >>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió: >>>> >>>>> Hello Team, >>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO >>>>> object storage . It was slower when compared to write to MapR FS with >>>>> above >>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and >>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter >>>>> and >>>>> check ed the performance which is worse than old tech stack. Then tried >>>>> using magic committer and it came out slower than V2 so with the latest >>>>> tech stack the performance is down graded. Could some please assist . >>>>> Sent from my iPhone >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>>