Re: ASF board report draft for February 2025

Ángel Thu, 06 Feb 2025 05:21:27 -0800

I guess you're right, sorry. I was only commenting on the statement:
'Apache Spark is a fast and general-purpose engine for large-scale data
processing.' I never said this was a stopper.


El jue, 6 feb 2025 a las 12:34, Mich Talebzadeh (<[email protected]>)
escribió:

> I don't see its relevance to ASF board report? It is a minor technicality
> and probably tangential. It is not a show stopper and the Board does it
> need to worry about it.
>
> Best to take this discussion on its own thread
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Thu, 6 Feb 2025 at 08:05, Ángel <[email protected]> wrote:
>
>> Btw, while analyzing this issue, I've also noticed that exactly the same
>> plan got stringified several times. Not only that, but even within a plan,
>> the same nodes got stringified dozens and dozens of times. I haven't
>> reported it because I added the memoization pattern to fix both things and,
>> despite fixing it ... the root issue with performance and OOM still
>> persisted.
>>
>> PS: Some nodes got stringified thousands of times. I was ... totally in
>> shock nobody had noticed it before.
>>
>> El jue, 6 feb 2025 a las 8:55, Ángel (<[email protected]>)
>> escribió:
>>
>>> If I'm not wrong, the events were still been generated and stored and
>>> contained the plans (but without the description). Maybe we could just
>>> simply... generate the strings "on demand" in a lazy fashion, when the user
>>> requests it on Spark UI.
>>>
>>> I don't know if that's even possible, just thought about it while
>>> walking my dog ...🐶
>>>
>>> El jue, 6 feb 2025, 8:41, Wenchen Fan <[email protected]> escribió:
>>>
>>>> Hi Angel,
>>>>
>>>> AFAIK many people rely on the Spark UI to debug/inspect their queries
>>>> with the query pan tree and metrics, but you are right that plan string
>>>> generation is expensive, and we shouldn't do it for every AQE plan change.
>>>> Maybe we should do it only once to report the final plan for AQE? Let's
>>>> continue the discussion on the PR.
>>>>
>>>> On Thu, Feb 6, 2025 at 1:48 PM Ángel <[email protected]>
>>>> wrote:
>>>>
>>>>> I'd like to add that Spark is not as fast as it should be, primarily
>>>>> due to its internal verbosity, as reported in ticket *SPARK-50992
>>>>> <https://issues.apache.org/jira/browse/SPARK-50992>*. After
>>>>> submitting this  PR <https://github.com/apache/spark/pull/49724>, I
>>>>> received some comments, which I quickly addressed, but the PR has since
>>>>> stalled.
>>>>>
>>>>> I strongly believe that Spark should prioritize performance over
>>>>> internal logging, especially when it has such a significant impact on
>>>>> execution speed and can lead to memory issues.
>>>>>
>>>>> In *GraphFrames*, the temporary workaround was to disable *AQE
>>>>> (Adaptive Query Execution)*. Just last week, I gave the same advice
>>>>> to a colleague experiencing performance issues with a *Databricks*
>>>>> notebook—and it worked. Disabling *AQE* to improve performance
>>>>> because Spark continuously generates string descriptions of physical plans
>>>>> internally -  that very likely noone is going to make use of them - makes
>>>>> little sense to me.
>>>>> PS: I wish I was wrong, but I really think I am not.
>>>>> PS2: The first part of a series of articles I'm wrting about this
>>>>> issue: link
>>>>> <https://medium.com/@angel.alvarez.pascua/apache-spark-wtf-i-like-it-when-a-plan-comes-together-part-i-48c52a667288>
>>>>>
>>>>> El jue, 6 feb 2025 a las 6:30, Adam Hobbs
>>>>> (<[email protected]>) escribió:
>>>>>
>>>>>> I'd like to add something around the failure to get any traction on
>>>>>> shepparding of the structured streaming DRA PR.  Multiple times now there
>>>>>> have been calls for help to get this initiative over the line and the
>>>>>> response has been disappointing.  The github PR has been closed due to
>>>>>> inaction (https://github.com/apache/spark/pull/42352).
>>>>>>
>>>>>> This seems like a bit of a failure in the process
>>>>>> .
>>>>>> Regards,
>>>>>>
>>>>>> Adam Hobbs
>>>>>>
>>>>>>
>>>>>> C2 - Internal Use
>>>>>> -----Original Message-----
>>>>>> From: Matei Zaharia <[email protected]>
>>>>>> Sent: Thursday, 6 February 2025 2:57 PM
>>>>>> To: Spark dev list <[email protected]>
>>>>>> Cc: [email protected]
>>>>>> Subject: ASF board report draft for February 2025
>>>>>>
>>>>>> CAUTION: This email originated from outside of the organisation. Do
>>>>>> not click links or open attachments unless you recognise the sender's 
>>>>>> full
>>>>>> email address and know the content is safe.
>>>>>>
>>>>>>
>>>>>> It’s time to send our next ASF board report again on February 12th.
>>>>>> Here’s an initial draft — feel free to suggest changes:
>>>>>>
>>>>>> =====================
>>>>>>
>>>>>>
>>>>>> Description:
>>>>>>
>>>>>> Apache Spark is a fast and general purpose engine for large-scale
>>>>>> data processing. It offers high-level APIs in Java, Scala, Python, R and
>>>>>> SQL as well as a rich set of libraries including stream processing, 
>>>>>> machine
>>>>>> learning, and graph analytics.
>>>>>>
>>>>>> Issues for the board:
>>>>>>
>>>>>> - None
>>>>>>
>>>>>> Project status:
>>>>>>
>>>>>> - The Spark 4.0 branch has been cut and has entered the QA stage. We
>>>>>> encourage the community to test it out!
>>>>>> - We released Spark 3.5.4 on December 20th, 2024.
>>>>>> - The PMC voted to add one new committer (Bingkun Pan) and one new
>>>>>> PMC member (Jie Yang) to the project.
>>>>>> - The proposal to "Use plain text logs by default" was successfully
>>>>>> passed.
>>>>>>
>>>>>> Trademarks:
>>>>>>
>>>>>> - No changes since last report.
>>>>>>
>>>>>> Latest releases:
>>>>>>
>>>>>> - Spark 3.5.4 was released on Dec 20, 2024
>>>>>> - Spark 3.4.4 was released on Oct 27, 2024
>>>>>> - Spark 4.0 Preview 2 was released on Sept 26, 2024
>>>>>>
>>>>>> Committers and PMC:
>>>>>>
>>>>>> - The latest committer was added on Nov 13, 2024 (Bingkun Pan).
>>>>>> - The latest PMC member was added on Jan 21st, 2025 (Jie Yang).
>>>>>>
>>>>>> =====================
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>> ********************************************************************************
>>>>>>
>>>>>> This communication is intended only for use of the addressee and may
>>>>>> contain legally privileged and confidential information.
>>>>>> If you are not the addressee or intended recipient, you are notified
>>>>>> that any dissemination, copying or use of any of the information is
>>>>>> unauthorised.
>>>>>>
>>>>>> The legal privilege and confidentiality attached to this e-mail is
>>>>>> not waived, lost or destroyed by reason of a mistaken delivery to you.
>>>>>> If you have received this message in error, we would appreciate an
>>>>>> immediate notification via e-mail to [email protected]
>>>>>> or by phoning 1300 BENDIGO (1300 236 344), and ask that the e-mail be
>>>>>> permanently deleted from your system.
>>>>>>
>>>>>> Bendigo and Adelaide Bank Limited ABN 11 068 049 178
>>>>>>
>>>>>>
>>>>>> ********************************************************************************
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>
>>>>>>

Re: ASF board report draft for February 2025

Reply via email to