Could you take three thread dumps from one of the executors while Spark is
performing the conversion? You can use the Spark UI for that.
El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
angel.alvarez.pas...@gmail.com>) escribió:
> Without the data, it's difficult to analyze. Could you prov
That's why I used the words "synthetic" and "fake" when referring to data.
Anyway, the most important thing might be the thread dumps.
El dom, 23 mar 2025 a las 3:29, Prem Sahoo ()
escribió:
> This is inside my current project , I can’t move data to public domain .
> But it seems there is someth
Hello ,I read the csv file having size of 2.7 gb which is having 100 columns , when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This stat is bad . Sent from my iPhoneOn Mar 22, 2025, at 9:21 PM, Ángel Álvare
We've seen significant performance gains in CSV going from 3.1 -> 3.5.
You've very exactly pointed out the change in fileoutputcommitter. v1
(safe, serial, slow) -> v2 (object store unsafe if no atomic rename,
parallel, faster). In V1, the output files are moved by the driver
serially from the s
Hey, just this week I found some issues with the Univocity library that
Spark internally uses to read CSV files.
*Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
I initially assumed this issue had existed since Sp
Without the data, it's difficult to analyze. Could you provide some
synthetic data so I can investigate this further? The schema and a few
sample fake rows should be sufficient.
El dom, 23 mar 2025 a las 3:17, Prem Sahoo ()
escribió:
> I am providing the schema , and schema is actually correct m
This is inside my current project , I can’t move data to public domain . But it seems there is something changed which made this slowness .Sent from my iPhoneOn Mar 22, 2025, at 10:23 PM, Ángel Álvarez Pascua wrote:Could you take three thread dumps from one of the executors while Spark is perform
I am providing the schema , and schema is actually correct means it has all the columns available in csv . So we can take out this issue for slowness . May be there is some other contributing options .Sent from my iPhoneOn Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua wrote:Hey, just this week
Thx @Ángel , I had a PR
https://github.com/apache/spark/pull/50334 to fix it. Please help review.
Ángel Álvarez Pascua 于2025年3月22日周六 16:27写道:
> Is anyone looking into this issue? If not, I'd like to try fixing it. I've
> never tried out Spark Connect, so... 2x1! (way better than spending the
>
Great to hear that!
Fortunately, I was waiting for someone to answer and hadn't looked at it
yet.
Seems a quite straightforward solution for an issue that shouldn't have to
exist if the right unit test had been implemented. What do you think about
adding a test to check this issue doesn't happen
Sure. I love performance challenges and mysteries!
Please, could you provide an example project or the steps to build one?
Thanks.
El dom, 23 mar 2025, 2:17, Prem Sahoo escribió:
> Hello Team,
> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object
> storage . It was slower
Hello Team,
I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object
storage . It was slower when compared to write to MapR FS with above tech
stack. Then moved on to later upgraded version of Spark 3.5.2 and Hadoop 4.3.1
which started writing to MinIO with V2 fileoutputcommitte
+1 (non binding)
On Mar 21, 2025, at 12:52 PM, Jules Damji wrote:
+1 (non-binding)
—
Sent from my iPhone
Pardon the dumb thumb typos :)
On Mar 21, 2025, at 11:47 AM, Anton Okolnychyi wrote:
Hi all,
I would like to start a vote on adding support for constraints to DSv2.
Discussion thread:
+1 (non-binding)
Thanks for working on this Anton! Some links to other engines that also did
something similar:
HIVE-13076 - https://issues.apache.org/jira/browse/HIVE-13076
IMPALA-3531 - https://issues.apache.org/jira/browse/IMPALA-3531
In fact, Spark had a very old Jira
SPARK-19842 - https://
+1
On Sat, Mar 22, 2025 at 7:01 PM Peter Toth wrote:
> +1
>
> On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho
> wrote:
>
>> +1 (non binding)
>>
>> Agree with Anton, data sources like the open table formats define the
>> requirement, and definitely need engines to write to it accordingly.
>>
>> Thank
+1
On Fri, Mar 21, 2025 at 10:24 PM Szehon Ho wrote:
> +1 (non binding)
>
> Agree with Anton, data sources like the open table formats define the
> requirement, and definitely need engines to write to it accordingly.
>
> Thanks,
> Szehon
>
> On Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi
> wr
+1Sent from my iPhoneOn Mar 21, 2025, at 2:25 PM, Szehon Ho wrote:+1 (non binding)Agree with Anton, data sources like the open table formats define the requirement, and definitely need engines to write to it accordingly.Thanks,SzehonOn Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi
Is anyone looking into this issue? If not, I'd like to try fixing it. I've
never tried out Spark Connect, so... 2x1! (way better than spending the
weekend binge-watching on Netflix 😅🤣).
@Bobby , thanks a lot, not only for reporting the issue
but also for providing a time-saving project for testin
One thing is enforcing the quality of the data Spark is producing, and
another thing entirely is defining an external data model from Spark.
The proposal doesn’t necessarily facilitate data accuracy and consistency.
Defining constraints does help with that, but the question remains: Is
Spark trul
19 matches
Mail list logo