I guess you're right, sorry. I was only commenting on the statement: 'Apache Spark is a fast and general-purpose engine for large-scale data processing.' I never said this was a stopper.
El jue, 6 feb 2025 a las 12:34, Mich Talebzadeh (<mich.talebza...@gmail.com>) escribió: > I don't see its relevance to ASF board report? It is a minor technicality > and probably tangential. It is not a show stopper and the Board does it > need to worry about it. > > Best to take this discussion on its own thread > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Thu, 6 Feb 2025 at 08:05, Ángel <angel.alvarez.pas...@gmail.com> wrote: > >> Btw, while analyzing this issue, I've also noticed that exactly the same >> plan got stringified several times. Not only that, but even within a plan, >> the same nodes got stringified dozens and dozens of times. I haven't >> reported it because I added the memoization pattern to fix both things and, >> despite fixing it ... the root issue with performance and OOM still >> persisted. >> >> PS: Some nodes got stringified thousands of times. I was ... totally in >> shock nobody had noticed it before. >> >> El jue, 6 feb 2025 a las 8:55, Ángel (<angel.alvarez.pas...@gmail.com>) >> escribió: >> >>> If I'm not wrong, the events were still been generated and stored and >>> contained the plans (but without the description). Maybe we could just >>> simply... generate the strings "on demand" in a lazy fashion, when the user >>> requests it on Spark UI. >>> >>> I don't know if that's even possible, just thought about it while >>> walking my dog ...🐶 >>> >>> El jue, 6 feb 2025, 8:41, Wenchen Fan <cloud0...@gmail.com> escribió: >>> >>>> Hi Angel, >>>> >>>> AFAIK many people rely on the Spark UI to debug/inspect their queries >>>> with the query pan tree and metrics, but you are right that plan string >>>> generation is expensive, and we shouldn't do it for every AQE plan change. >>>> Maybe we should do it only once to report the final plan for AQE? Let's >>>> continue the discussion on the PR. >>>> >>>> On Thu, Feb 6, 2025 at 1:48 PM Ángel <angel.alvarez.pas...@gmail.com> >>>> wrote: >>>> >>>>> I'd like to add that Spark is not as fast as it should be, primarily >>>>> due to its internal verbosity, as reported in ticket *SPARK-50992 >>>>> <https://issues.apache.org/jira/browse/SPARK-50992>*. After >>>>> submitting this PR <https://github.com/apache/spark/pull/49724>, I >>>>> received some comments, which I quickly addressed, but the PR has since >>>>> stalled. >>>>> >>>>> I strongly believe that Spark should prioritize performance over >>>>> internal logging, especially when it has such a significant impact on >>>>> execution speed and can lead to memory issues. >>>>> >>>>> In *GraphFrames*, the temporary workaround was to disable *AQE >>>>> (Adaptive Query Execution)*. Just last week, I gave the same advice >>>>> to a colleague experiencing performance issues with a *Databricks* >>>>> notebook—and it worked. Disabling *AQE* to improve performance >>>>> because Spark continuously generates string descriptions of physical plans >>>>> internally - that very likely noone is going to make use of them - makes >>>>> little sense to me. >>>>> PS: I wish I was wrong, but I really think I am not. >>>>> PS2: The first part of a series of articles I'm wrting about this >>>>> issue: link >>>>> <https://medium.com/@angel.alvarez.pascua/apache-spark-wtf-i-like-it-when-a-plan-comes-together-part-i-48c52a667288> >>>>> >>>>> El jue, 6 feb 2025 a las 6:30, Adam Hobbs >>>>> (<adam.ho...@bendigoadelaide.com.au.invalid>) escribió: >>>>> >>>>>> I'd like to add something around the failure to get any traction on >>>>>> shepparding of the structured streaming DRA PR. Multiple times now there >>>>>> have been calls for help to get this initiative over the line and the >>>>>> response has been disappointing. The github PR has been closed due to >>>>>> inaction (https://github.com/apache/spark/pull/42352). >>>>>> >>>>>> This seems like a bit of a failure in the process >>>>>> . >>>>>> Regards, >>>>>> >>>>>> Adam Hobbs >>>>>> >>>>>> >>>>>> C2 - Internal Use >>>>>> -----Original Message----- >>>>>> From: Matei Zaharia <matei.zaha...@gmail.com> >>>>>> Sent: Thursday, 6 February 2025 2:57 PM >>>>>> To: Spark dev list <dev@spark.apache.org> >>>>>> Cc: priv...@spark.apache.org >>>>>> Subject: ASF board report draft for February 2025 >>>>>> >>>>>> CAUTION: This email originated from outside of the organisation. Do >>>>>> not click links or open attachments unless you recognise the sender's >>>>>> full >>>>>> email address and know the content is safe. >>>>>> >>>>>> >>>>>> It’s time to send our next ASF board report again on February 12th. >>>>>> Here’s an initial draft — feel free to suggest changes: >>>>>> >>>>>> ===================== >>>>>> >>>>>> >>>>>> Description: >>>>>> >>>>>> Apache Spark is a fast and general purpose engine for large-scale >>>>>> data processing. It offers high-level APIs in Java, Scala, Python, R and >>>>>> SQL as well as a rich set of libraries including stream processing, >>>>>> machine >>>>>> learning, and graph analytics. >>>>>> >>>>>> Issues for the board: >>>>>> >>>>>> - None >>>>>> >>>>>> Project status: >>>>>> >>>>>> - The Spark 4.0 branch has been cut and has entered the QA stage. We >>>>>> encourage the community to test it out! >>>>>> - We released Spark 3.5.4 on December 20th, 2024. >>>>>> - The PMC voted to add one new committer (Bingkun Pan) and one new >>>>>> PMC member (Jie Yang) to the project. >>>>>> - The proposal to "Use plain text logs by default" was successfully >>>>>> passed. >>>>>> >>>>>> Trademarks: >>>>>> >>>>>> - No changes since last report. >>>>>> >>>>>> Latest releases: >>>>>> >>>>>> - Spark 3.5.4 was released on Dec 20, 2024 >>>>>> - Spark 3.4.4 was released on Oct 27, 2024 >>>>>> - Spark 4.0 Preview 2 was released on Sept 26, 2024 >>>>>> >>>>>> Committers and PMC: >>>>>> >>>>>> - The latest committer was added on Nov 13, 2024 (Bingkun Pan). >>>>>> - The latest PMC member was added on Jan 21st, 2025 (Jie Yang). >>>>>> >>>>>> ===================== >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>>> ******************************************************************************** >>>>>> >>>>>> This communication is intended only for use of the addressee and may >>>>>> contain legally privileged and confidential information. >>>>>> If you are not the addressee or intended recipient, you are notified >>>>>> that any dissemination, copying or use of any of the information is >>>>>> unauthorised. >>>>>> >>>>>> The legal privilege and confidentiality attached to this e-mail is >>>>>> not waived, lost or destroyed by reason of a mistaken delivery to you. >>>>>> If you have received this message in error, we would appreciate an >>>>>> immediate notification via e-mail to contac...@bendigoadelaide.com.au >>>>>> or by phoning 1300 BENDIGO (1300 236 344), and ask that the e-mail be >>>>>> permanently deleted from your system. >>>>>> >>>>>> Bendigo and Adelaide Bank Limited ABN 11 068 049 178 >>>>>> >>>>>> >>>>>> ******************************************************************************** >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>>