Without the data, it's difficult to analyze. Could you provide some synthetic data so I can investigate this further? The schema and a few sample fake rows should be sufficient.
El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>) escribió: > I am providing the schema , and schema is actually correct means it has > all the columns available in csv . So we can take out this issue for > slowness . May be there is some other contributing options . > Sent from my iPhone > > On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > > > > Hey, just this week I found some issues with the Univocity library that > Spark internally uses to read CSV files. > > *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser* > https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579 > > I initially assumed this issue had existed since Spark started using this > library, but perhaps something changed in the versions you mentioned. > > Are you providing a schema, or are you letting Spark infer it? I've also > noticed that when the schema doesn't match the columns in the CSV files > (for example, different number of columns), exceptions are thrown > internally. > > Given all this, my initial hypothesis is that thousands upon thousands of > exceptions are being thrown internally, only to be handled by the Univocity > parser—so the user isn't even aware of what's happening. > > > El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>) > escribió: > >> Hello , >> I read the csv file having size of 2.7 gb which is having 100 columns , >> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it >> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This >> stat is bad . >> Sent from my iPhone >> >> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com> wrote: >> >> >> Sure. I love performance challenges and mysteries! >> >> Please, could you provide an example project or the steps to build one? >> >> Thanks. >> >> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió: >> >>> Hello Team, >>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO >>> object storage . It was slower when compared to write to MapR FS with above >>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and >>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and >>> check ed the performance which is worse than old tech stack. Then tried >>> using magic committer and it came out slower than V2 so with the latest >>> tech stack the performance is down graded. Could some please assist . >>> Sent from my iPhone >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>