Create the JIRA and we can look at it.
if it is just write performance, then I am confident that that hadoo 3.4.1
is way faster writing code, with some extra parameters available to make
things faster.
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html#Options_to_T
1. mino does actually have atomic object relenames, but as it is file by
file, task commit is nonatomic;
2. v2 task commit is also unsafe -it just writes to the destination.
There is no way committer which supports task failure can be as fast as
this.
further reading
https://github.
Just one more variable is Spark 3.5.2 runs on kubernetes and Spark 3.2.0 runs on YARN . It seems kubernetes can be a cause of slowness too .Sent from my iPhoneOn Mar 24, 2025, at 7:10 PM, Prem Gmail wrote:Hello Spark Dev/users,Any one has any clue why and how a better version have performance iss
@Prem Sahoo , could you test both versions of
Spark+Hadoop by replacing your "write to MinIO" statement with
write.format("noop")? This would help us determine whether the issue lies
on the reader side or the writer side.
El dom, 23 mar 2025 a las 4:53, Prem Gmail ()
escribió:
> V2 writer in 3.
The problem is on the writer's side. It takes longer to write to Minio with
Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes
between hadoop 2.7.6 to 3.4.1 which made the write process faster.
On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.c
Could you take three thread dumps from one of the executors while Spark is
performing the conversion? You can use the Spark UI for that.
El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
angel.alvarez.pas...@gmail.com>) escribió:
> Without the data, it's difficult to analyze. Could you prov
That's why I used the words "synthetic" and "fake" when referring to data.
Anyway, the most important thing might be the thread dumps.
El dom, 23 mar 2025 a las 3:29, Prem Sahoo ()
escribió:
> This is inside my current project , I can’t move data to public domain .
> But it seems there is someth
Hello ,I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad . Sent from my iPhoneOn Mar 22, 2025, at 9:21 PM, Ángel Álvare
We've seen significant performance gains in CSV going from 3.1 -> 3.5.
You've very exactly pointed out the change in fileoutputcommitter. v1
(safe, serial, slow) -> v2 (object store unsafe if no atomic rename,
parallel, faster). In V1, the output files are moved by the driver
serially from the s
Hey, just this week I found some issues with the Univocity library that
Spark internally uses to read CSV files.
*Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
I initially assumed this issue had existed since Sp
Without the data, it's difficult to analyze. Could you provide some
synthetic data so I can investigate this further? The schema and a few
sample fake rows should be sufficient.
El dom, 23 mar 2025 a las 3:17, Prem Sahoo ()
escribió:
> I am providing the schema , and schema is actually correct m
This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .Sent from my iPhoneOn Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua wrote:Could you take three thread dumps from one of the executors while Spark is perform
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness . May be there is some other contributing options .Sent from my iPhoneOn Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua wrote:Hey, just this week
Sure. I love performance challenges and mysteries!
Please, could you provide an example project or the steps to build one?
Thanks.
El dom, 23 mar 2025, 2:17, Prem Sahoo escribió:
> Hello Team,
> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object
> storage . It was slower
14 matches
Mail list logo