Re: [DISCUSS] SPIP: Upgrade Apache Hive to 4.x

2025-06-09 Thread Mich Talebzadeh
but worth investigating it. cheers Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sat, 7 Jun 2025 at 08:08, Ángel Álvarez Pascua < angel.al

Re: [DISCUSS] SPIP: Upgrade Apache Hive to 4.x

2025-06-05 Thread Mich Talebzadeh
tunately something is missing somewhere They have seen this error with postgres Hive metastore DB as well. I need to work on it when I have a chance HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com

Re: [VOTE] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-06-04 Thread Mich Talebzadeh
And great effort by you Jerry to drive this proposal through. Let us see how it progresses.Will be interesting Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: [VOTE] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-06-02 Thread Mich Talebzadeh
(e.g., transactional sinks) or careful custom implementation for both stateless and stateful operations etc Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-p

Re: Question Regarding Spark Dependencies in Scala

2025-05-31 Thread Mich Talebzadeh
Are you running in YARN mode and you want to put these jar files into HDFS in a distributed cluster? HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mich Talebzadeh
"near real-time streaming" or "interactive streaming" to accurately describe the system's capabilities and bridge the gap between academic rigor and practical industry usage. This IMO is a good suggestion to reduce ambiguity. HTH Dr Mich Talebzadeh, Architect | Data Science | Fina

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mich Talebzadeh
" are typically operating on the softer end of this spectrum, providing performance crucial for applications under considerations (for example within SLAs) where delays are undesirable but not show stopper. I therefore suggest the SPIP should mention this explicitly, so we can move on

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mich Talebzadeh
ot; Principle In summary, "Real-time Mode" seems to describe an approach that delivers low-latency processing with high reliability and ease of use, leveraging established, battle-tested components.I invite the audience to have a discussion on this. HTH Dr Mich Talebzadeh, Architect | Dat

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
uot; answer is simply not good enough. As a colliery it is a fundamental concept, so it has to be treated as such not as a comment.in SPIP Hope this clarifies the connection in practical terms Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linked

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
of the application.if I get the right answer too slowly it becomes useless or wrong Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 28 May 2025 at

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
tra low-latency execution mode. A time interval can also be specified, e.g. “300 Seconds”, to indicate how long each micro-batch should run for. " will inevitably depend on many factors. Not that simple HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analys

Re: ASF Board report draft for May 2025

2025-05-10 Thread Mich Talebzadeh
Maybe you should emphasize Sparc 4 (RC5) as the current state of sparc 4, undergoing extensive testing. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: [VOTE] SPIP: Declarative Pipelines

2025-04-09 Thread Mich Talebzadeh
+1 Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 9 Apr 2025 at 20:05, Gengliang Wang wrote: > +1 > > On Wed, Apr 9, 2025 at 11:57 

Re: [DISCUSS] SPIP: Declarative Pipelines

2025-04-09 Thread Mich Talebzadeh
+1 Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 9 Apr 2025 at 08:07, Peter Toth wrote: > +1 > > On Wed, Apr 9, 2025 at 8:51 AM C

Re: [DISCUSS] Upgrade Hive compile time dependency to 4.0

2025-03-26 Thread Mich Talebzadeh
Because of dependencies we need to ensure that the underlying artifacts (Hive 4.0.1) is also stable enough. We should aim to establish that first and look for release timelines and where it fits cheers Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

Re: [DISCUSS] Upgrade Hive compile time dependency to 4.0

2025-03-23 Thread Mich Talebzadeh
major headache. Now I just need to customise various files under $HIVE_HOME/conf and then I will have some testing underway. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-p

Re: [DISCUSS] New Spark Connect Client repository for Swift language

2025-03-16 Thread Mich Talebzadeh
+1 Sounds like a plan Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sun, 16 Mar 2025 at 21:10, Martin Grund wrote: > So I was just playing with

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-16 Thread Mich Talebzadeh
open forum, then the person is expected to back it up. *I cannot see how anyone could object to the statement: if you make a claim or have a strong opinion, be prepared to prove it or debate it.* Regardless, as stated mistakes can and do happen. HTH Dr Mich Talebzadeh, Architect | Data Science

Re: [DISCUSS] Involve any hack / workaround to not include vendor name in migration logic

2025-03-16 Thread Mich Talebzadeh
Hi Jungtaek. With regard to your point below "...Hi dev, I'm really tired of the discussion which does not move forward because the argument is not backed by strict ASF policy" Regardless, we all appreciate your efforts and your tenacity. cheers Dr Mich Talebzadeh, A

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-15 Thread Mich Talebzadeh
of Compound Sentiment Scores) / (Total Messages Sent) [image: sentiment_score.png] Dongjoon sentiment seems to be pretty neutral and the rest mildly positive HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://ww

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mich Talebzadeh
This is my gist Mark from your passionate language I gather you see this as a "Code Change" veto. Your reasoning seems to be straightforward, i.e. the vote's purpose is to decide whether to add code (migration logic) to the Spark 4.0 branch. In your view, the outcome of the vote directly alters th

Re: [DISCUSS] Upgrade Hive compile time dependency to 4.0

2025-03-12 Thread Mich Talebzadeh
Agreed. Hive upgrade is more time consuming as it involves backing up Hive schema on your metastore and then running Hive provided upgrade schema scripts against Hive schema that could be problematic,but needs to be done one way or another. HTH Dr Mich Talebzadeh, Architect | Data Science

Re: [DISCUSS] Upgrade Hive compile time dependency to 4.0

2025-03-11 Thread Mich Talebzadeh
s, and bug fixes. Compiling against it would allow Spark to take advantage of these. Plus using the latest versions of both Spark and Hive is important for maintaining a secure data platform. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedi

Re: [DISCUSS] New Spark Connect Client repository for Swift language

2025-03-11 Thread Mich Talebzadeh
The first link seems to be still invalid, although the proposal itself is sound https://github.com/apache/spark-connect-swift Can someone else please confirm it? Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <ht

Re: [DISCUSS] New Spark Connect Client repository for Swift language

2025-03-10 Thread Mich Talebzadeh
Glad to see that eventually this repository is created now Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Mar 2025 at 23:37, Dongjoon Hyun

Re: [DISCUSS] New Spark Connect Client repository for Swift language

2025-03-09 Thread Mich Talebzadeh
Can you please double check the first link, I am getting 404! thanks Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sun, 9 Mar 2025 at 22:31, Dongjoo

Re: Seek for consensus on landing Spark Connect implementation for transformWithState in Spark 4.0.0

2025-03-04 Thread Mich Talebzadeh
Sure we leave it as it is. No big deal Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Tue, 4 Mar 2025 at 23:29, Jungtaek Lim wrote: > Thanks for

Re: Seek for consensus on landing Spark Connect implementation for transformWithState in Spark 4.0.0

2025-03-04 Thread Mich Talebzadeh
ately that Spark Connect is an interface for interacting with Spark, not a replacement for the entire system. HTH .. Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Seek for consensus on landing Spark Connect implementation for transformWithState in Spark 4.0.0

2025-03-04 Thread Mich Talebzadeh
Thanks. Can you point to a link or any further documentation please? Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Tue, 4 Mar 2025 at 13:22, Herm

Re: Apache - GSOC'25 projects / Contributions

2025-02-24 Thread Mich Talebzadeh
more informed knowledge. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 24 Feb 2025 at 19:13, D. Mohith Akshay wrote: > Hello Everyone, &g

Re: [VOTE] Release Spark 3.5.5 (RC1)

2025-02-23 Thread Mich Talebzadeh
+1 on the basis of Dongjoon statement which I trust HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 24 Feb 2025 at 00:47, Dongjoon Hyun

Re: [VOTE] SPIP: Add the TIME data type

2025-02-23 Thread Mich Talebzadeh
+1 for me following my recent comments on the discussion thread on this topic as well Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sun, 23 Feb 2025 at

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-23 Thread Mich Talebzadeh
thread. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 17 Feb 2025 at 16:07, Max Gekk wrote: > Hello Mich, > > Thank you for the pro

Re: [VOTE] Release Spark 4.0.0 (RC1)

2025-02-20 Thread Mich Talebzadeh
. - RC1 is typically followed by a sequence of additional RCs (e.g., RC2, RC3) as needed, until all blockers are resolved and the final release is ready. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <ht

Re: [VOTE] Release Spark 4.0.0 (RC1)

2025-02-19 Thread Mich Talebzadeh
+1 Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 19 Feb 2025 at 09:31, Wenchen Fan wrote: > Please vote on releasing the following candidate

Re: [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` configuration

2025-02-18 Thread Mich Talebzadeh
+1 Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 19 Feb 2025 at 06:51, Ángel wrote: > +1 (non-binding) > > El mié, 19 feb 2025, 7:

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mich Talebzadeh
through intermediate versions to avoid breakage. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-13 Thread Mich Talebzadeh
program and make it work HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 12 Feb 2025 at 19:53, Max Gekk wrote: > Hello Mich, > > >

Re: ASF board report draft for February 2025

2025-02-12 Thread Mich Talebzadeh
✅ *"Thanks, Matei. ✅ Looks like a plan!* *📌 We resurrected the old thread! * *https://lists.apache.org/thread/wwjyp1bhryvx7ytooj1lqtd8kgzxb6vq <https://lists.apache.org/thread/wwjyp1bhryvx7ytooj1lqtd8kgzxb6vq>* 🔗 Hopefully, there will be more traction this round. HTH Dr Mich

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Mich Talebzadeh
it to a default value. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 12 Feb 2025 at 18:56, Sakthi wrote: > Thanks for the proposal, Max. T

Re: ASF board report draft for February 2025

2025-02-11 Thread Mich Talebzadeh
Let us carry on on that thread. Need to catch-up HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Tue, 11 Feb 2025 at 06:01, Pavan Kotikalapudi

Re: ASF board report draft for February 2025

2025-02-10 Thread Mich Talebzadeh
can HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Feb 2025 at 23:05, Jungtaek Lim wrote: > Let's move the discussion to the other t

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Feb 2025 at 12:39, José Müller wrote: > Hi Mitch, > > All you said is well understood, but I believe you

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Mich Talebzadeh
cluster, Have you looked at Koalas which I believe is currently integrated as pyspark.pandas? HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Fe

Re: Extending Spark with a custom ExternalClusterManager

2025-02-07 Thread Mich Talebzadeh
Well, everything is possible. Please initiate a discussion on the matter of a proposal to "Create a pluggable cluster manager" and put it to the community. See some examples here https://lists.apache.org/list.html?dev@spark.apache.org HTH Dr Mich Talebzadeh, Architect | Data Science |

Re: Extending Spark with a custom ExternalClusterManager

2025-02-07 Thread Mich Talebzadeh
YARN). 2. Implementing *a full pluggability for spark-submit *would require redesign and implementation to handle the diverse requirements of different cluster managers which I think will be a major project for itself HTH Dr Mich Talebzadeh, Architect | Data Science | Financial

Re: Extending Spark with a custom ExternalClusterManager

2025-02-07 Thread Mich Talebzadeh
mode cluster \ --name sparkArmada then modify or copy Spark-Submit code to Spark-Submit-Armanda to handle this custom URL for now for test/debugging purposes HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.

Re: Extending Spark with a custom ExternalClusterManager

2025-02-07 Thread Mich Talebzadeh
Kubernetes cluster *as a separate container. which provides better resource isolation and is more suitable for this type of cluster you are using Armada Anyway you can see how it progresses in debugging mode. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

Re: Extending Spark with a custom ExternalClusterManager

2025-02-06 Thread Mich Talebzadeh
I am familiar with some of your work in G-Research HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Thu, 6 Feb 2025 at 23:40, Dejan Pejchev wrote: &

Re: ASF board report draft for February 2025

2025-02-06 Thread Mich Talebzadeh
I don't see its relevance to ASF board report? It is a minor technicality and probably tangential. It is not a show stopper and the Board does it need to worry about it. Best to take this discussion on its own thread Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | For

Re: [VOTE] Publish additional Spark distribution with Spark Connect enabled

2025-02-05 Thread Mich Talebzadeh
+1 Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 5 Feb 2025 at 08:26, Yuming Wang wrote: > +1 > > On Wed, Feb 5, 2025 at 4:15 PM

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Mich Talebzadeh
Hi Frank, I think this would be for the Spark dev team. I have added to the email. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Fri, 31 Jan 2025

testing

2025-01-31 Thread Mich Talebzadeh
Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Proposal to improve data skew debugging

2025-01-29 Thread Mich Talebzadeh
Hi Rob, As a matter of interest, have you got an indication of a ballpark figure for percentage of queries that end up with skewed distribution? Thanks Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Mich Talebzadeh
. 1 hduser hadoop44704 Oct 21 03:29 hive-cli-2.3.9.jar -rw-r--r--. 1 hduser hadoop 183633 Oct 21 03:29 hive-beeline-2.3.9.jar I have all these jars there but are you implying that the potential vulnerability will be from hive-metastore-2.3.9.jar alone or all of hive jars? Cheers Mich Talebza

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Mich Talebzadeh
To answer your question, I did not read this CVE, but I am responding solely from my previous experiences with vulennabiries and the thread owner implications, having used spark in conjunction with Spark for many years. Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Mich Talebzadeh
store, as they can indirectly impact the security and stability of Spark applications among other things HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On

Re: How do I repackage org.spark-project.hive-exec-1.2.1.spark2

2025-01-25 Thread Mich Talebzadeh
mv hive-exec-1.2.1.jar hive-exec-1.2.1.spark2.jar HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sat, 25 Jan 2025 at 08:44, 王则杰 wrote: > rename

Re: Proposal to improve data skew debugging

2025-01-24 Thread Mich Talebzadeh
Ok so the catalyst optimizer will use this method of inline key counting to provide spark optimizer with prior notification, so it identifies the hot keys? What is this inline key counting based? Likely Count-Min Sketch algorithm! HTH Mich Talebzadeh, Architect | Data Science | Financial Crime

Re: [DISCUSS] Ongoing projects for Spark 4.0

2025-01-22 Thread Mich Talebzadeh
e the challenges in a nutshell that you referred to? HTH, Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 22 Jan 2025 at 20:47, David Milicevic wrote

Re: How do I repackage org.spark-project.hive-exec-1.2.1.spark2

2025-01-22 Thread Mich Talebzadeh
Sorry I forgot to mention once you extract the JAR file, copy or symlink it to $SPARK_HOME/jars directory HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: FYI: A Hallucination about Spark Connect Stability in Spark 4

2025-01-22 Thread Mich Talebzadeh
CI broken is really an operational aspect albeit in this case was quote temporary. We should put that aside and move on as 1) product is sound and 2) spark connect is strategic for the future of Spark. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

Re: FYI: A Hallucination about Spark Connect Stability in Spark 4

2025-01-21 Thread Mich Talebzadeh
Spark's internals as opposed to RDDs. *Moreover, **maintaining backward compatibility fo*r the existing *RDD-based applications and libraries* is crucial during this transition window so the timeframe is another factor for consideration. HTH Mich Talebzadeh, Architect | Data Science | Fina

Re: How do I repackage org.spark-project.hive-exec-1.2.1.spark2

2025-01-21 Thread Mich Talebzadeh
mp/apache-hive-1.2.1-src/ql/target/" HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Tue, 21 Jan 2025 at 02:42, 王则杰 wrote: > I need to mo

Re: [DISCUSS] Support spark.ml on Spark Connect

2025-01-21 Thread Mich Talebzadeh
Given our recent discussion on using spark connect as a stable API, this will be another positive step. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Rethinking Spark, beyond ETL

2025-01-19 Thread Mich Talebzadeh
this evolution of Spark. HTH, Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Re: Increasing Shading & Relocating for 4.0

2025-01-19 Thread Mich Talebzadeh
Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed .

Re: Increasing Shading & Relocating for 4.0

2025-01-18 Thread Mich Talebzadeh
gt;>>>> At a high level, some notable shaded prefixes included org.json, >>>>> com.google.common / protobuf, org.apache.commons, and org.antlr. Key >>>>> dependencies *not* shaded were avro, jackson, datanucleus, logging / >>>>> JRE / sc

Re: PR review

2024-12-31 Thread Mich Talebzadeh
ewing past discussions and votes on the dev list will be very helpful and informative. HTH Architect | Data Science | Financial Crime | GDPR & Compliance Specialist PhD Imperial College London London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-tal

Re: [DISCUSS] Pythonic approach of setting Spark SQL configurations

2024-12-29 Thread Mich Talebzadeh
On your point ...I believe there are better ways to improve the pythonic surface of PySpark. .. Can you please elaborate? HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | GDPR & Compliance Specialist PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperi

Re: [DISCUSS] Pythonic approach of setting Spark SQL configurations

2024-12-27 Thread Mich Talebzadeh
ations, aligning with Python's emphasis on clarity and expressiveness (as the above link). HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | GDPR & Compliance Specialist PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https

Re: Increasing Shading & Relocating for 4.0

2024-12-07 Thread Mich Talebzadeh
shading will introduce more debugging and testing as packages will be renamed impacting flexibility. Case in point, things like unit and integration tests may need adjustments to account for the renamed packages. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | GDPR & Compli

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-14 Thread Mich Talebzadeh
+ 1 Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://w

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-13 Thread Mich Talebzadeh
Hm. Since it sounds like a plan why Russell you go ahead and create a SPIP for it, then, this discussion takes a formal approach and is documented. Otherwise we are just flogging a dead horse so to speak. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <ht

Re: [DISCUSS] Support spark.ml on Spark Connect

2024-11-13 Thread Mich Talebzadeh
OK I added a comment to PR HTH, Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-13 Thread Mich Talebzadeh
and actively contribute. If no substantial engagement occurs within this timeframe, we may need to consider closing the project. Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London

Re: [ANNOUNCE] Apache Spark 3.4.4 released

2024-10-27 Thread Mich Talebzadeh
Upgraded from Spark 3.4.0 to 3.4.4 Looks good with the following versions I have tested - openjdk 11.0.8 - hadoop-3.1.0 - hive-3.1.1 - hbase-1.2.6 - GoogleBigQuery with spark-3.4-bigquery-0.41.0.jar HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime

Re: [DISCUSS] Support spark.ml on Spark Connect

2024-10-15 Thread Mich Talebzadeh
+1 It will be a desirable feature Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-10 Thread Mich Talebzadeh
Hi Jay As far as I am aware in Spark 2.4.4, there is no feature to enable executor decommissioning with graceful shutdown, nor is there a way to specify a timeout for forcefully killing executors. These were introduced in Spark 3.0. HTH Mich Talebzadeh, Architect | Data Engineer | Data

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-10 Thread Mich Talebzadeh
to be clear are you referring to these spark.executor.decommission.enabled=true spark.executor.decommission.gracefulShutdown=true thanks Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Mich Talebzadeh
nfig("spark.executor.decommission.forceKillTimeout", "100s") \ .getOrCreate() Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imper

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Mich Talebzadeh
}") The output Spark version: 3.4.0 spark.executor.decommission.enabled: true spark.executor.decommission.forceKillTimeout: 100s By creating a simple Spark application and verifying the configuration values, I trust it is shown that these two parameters are valid and are appl

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Mich Talebzadeh
Do you have a better recommendation? Or trying to waste time as usual. It is far easier to throw than catch. Do your homework and stop throwing spanners at work. Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philoso

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Mich Talebzadeh
Before responding, what configuration parameters are you using to make this work? spark.executor.decommission.enabled=true spark.executor.decommission.gracefulShutdown=true spark.executor.decommission.forceKillTimeout=100s HTH Mich Talebzadeh, Architect | Data Engineer | Data Science

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-05 Thread Mich Talebzadeh
business and technical realities. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mich Talebzadeh
graph processing in Spark. I saw someone created some documents HTH Mich Talebzadeh, *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-tho

Re: [VOTE] Single-pass Analyzer for Catalyst

2024-10-03 Thread Mich Talebzadeh
+1 on the assumption that we should phase this release on an incremental basis. Probably will take us to end of release 5. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London

Re: [VOTE] Single-pass Analyzer for Catalyst

2024-10-03 Thread Mich Talebzadeh
ffs of complexity, resource availability and long-term gains. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-09-30 Thread Mich Talebzadeh
+1 Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://w

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-09-30 Thread Mich Talebzadeh
should prioritize the health of the Spark ecosystem and ensure that we are investing resources into actively maintained components. HTH Mich Talebzadeh Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London

Re: [VOTE] Document and Feature Preview via GitHub Pages

2024-09-11 Thread Mich Talebzadeh
+1 Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://w

Re: Question about Releases and EOL

2024-08-29 Thread Mich Talebzadeh
ement declaring Spark 2.4.0 as the final minor release, the fact that 2.4.8 is still being maintained suggests it might be an LTS release. This is likely due to its continued usage? HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.

Re: Please review (ValidateExternalType should return child in error)

2024-08-25 Thread Mich Talebzadeh
ards, > Mark Andreev > > > On Wed, 21 Aug 2024 at 23:08, Mich Talebzadeh > wrote: > >> Hi Mark, >> >> You have already done that and have made the request for review. >> >> +1 for me >> >> Mich Talebzadeh, >> >> Architect |

Re: Please review (ValidateExternalType should return child in error)

2024-08-21 Thread Mich Talebzadeh
Hi Mark, You have already done that and have made the request for review. +1 for me Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia

Re: Please review (ValidateExternalType should return child in error)

2024-08-20 Thread Mich Talebzadeh
ted}." By providing this additional context, developers can more efficiently pinpoint and resolve schema mismatches. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College L

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Mich Talebzadeh
k -f convert_sum.awk size.txt 11.88 GB Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Mich Talebzadeh
Hi Kent, Can you if possible provide a heuristic estimate of space reduction your proposal is going to achieve? Thanks Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Mich Talebzadeh
Hi Kent, Can you if possible please provide a heuristic estimate of storage reduction that will be achieved through this approach? Thanks Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial C

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Mich Talebzadeh
achieved through this approach. Overall, the proposal offers a viable solution for managing Spark documentation while reducing storage concerns. However, addressing the potential complexity of managing older documentation versions is crucial. +1 for me Mich Talebzadeh, Architect | Data Engineer | Data

Re: [VOTE] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-12 Thread Mich Talebzadeh
+1 for me Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin pr

  1   2   3   4   5   >