Could you take three thread dumps from one of the executors while Spark is performing the conversion? You can use the Spark UI for that.
El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (< angel.alvarez.pas...@gmail.com>) escribió: > Without the data, it's difficult to analyze. Could you provide some > synthetic data so I can investigate this further? The schema and a few > sample fake rows should be sufficient. > > El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>) > escribió: > >> I am providing the schema , and schema is actually correct means it has >> all the columns available in csv . So we can take out this issue for >> slowness . May be there is some other contributing options . >> Sent from my iPhone >> >> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com> wrote: >> >> >> >> Hey, just this week I found some issues with the Univocity library that >> Spark internally uses to read CSV files. >> >> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* >> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 >> >> I initially assumed this issue had existed since Spark started using this >> library, but perhaps something changed in the versions you mentioned. >> >> Are you providing a schema, or are you letting Spark infer it? I've also >> noticed that when the schema doesn't match the columns in the CSV files >> (for example, different number of columns), exceptions are thrown >> internally. >> >> Given all this, my initial hypothesis is that thousands upon thousands of >> exceptions are being thrown internally, only to be handled by the Univocity >> parser—so the user isn't even aware of what's happening. >> >> >> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>) >> escribió: >> >>> Hello , >>> I read the csv file having size of 2.7 gb which is having 100 columns , >>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it >>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This >>> stat is bad . >>> Sent from my iPhone >>> >>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua < >>> angel.alvarez.pas...@gmail.com> wrote: >>> >>> >>> Sure. I love performance challenges and mysteries! >>> >>> Please, could you provide an example project or the steps to build one? >>> >>> Thanks. >>> >>> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió: >>> >>>> Hello Team, >>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO >>>> object storage . It was slower when compared to write to MapR FS with above >>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and >>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and >>>> check ed the performance which is worse than old tech stack. Then tried >>>> using magic committer and it came out slower than V2 so with the latest >>>> tech stack the performance is down graded. Could some please assist . >>>> Sent from my iPhone >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>