Do we have a Java Client for Spark Connect which is something like PySpark?

From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: 22 January 2025 15:05
To: Hyukjin Kwon <gurwls...@apache.org>
Cc: Martin Grund <mar...@databricks.com.invalid>; Holden Karau 
<holden.ka...@gmail.com>; Dongjoon Hyun <dongj...@apache.org>; dev 
<dev@spark.apache.org>
Subject: [EXTERNAL] Re: FYI: A Hallucination about Spark Connect Stability in 
Spark 4

CI broken is really an operational aspect albeit in this case was quote 
temporary. We should put that aside and move on as 1) product is sound and 2) 
spark connect is strategic for the future of Spark. HTH Mich Talebzadeh, 
Architect | Data Science

CI broken is really an operational aspect albeit in this case was quote 
temporary. We should put that aside and move on as 1) product is sound and 2) 
spark connect is strategic for the future of Spark.

HTH

Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >




On Wed, 22 Jan 2025 at 09:26, Hyukjin Kwon 
<gurwls...@apache.org<mailto:gurwls...@apache.org>> wrote:
While it might be a bit too much to talk about its stability, it is true that 
the CI dedicated for Spark Connect compat was broken there for a couple of 
weeks, and the errors from the tests look confusing.
I agree that tests and builds could be one of the easiest measurements to tell 
the state of a project, and we should probably make sure those builds pass 
properly.

I don't particularly agree with either side (either it is unstable as CI is 
broken or, it is perfectly stable with CI broken).
The truth is that the scheduled build for Spark Connect is broken for a couple 
of weeks, which is bad. We should fix it to keep the project running healthily.

On Wed, 22 Jan 2025 at 18:13, Martin Grund 
<mar...@databricks.com.invalid<mailto:mar...@databricks.com.invalid>> wrote:
I'm very confused about how we use stability in CI as a measure to discuss the 
strategy of a particular feature, particularly because we call these 
"hallucinations."

>From real-world experience, I can say that we have thousands of clients using 
>Spark Connect across many different versions in our infrastructure without any 
>issues and this since Spark 3.4 (e.g. Spark 3.4 clients talking to Spark 3.5 
>and above).

Second, from maintaining the Golang client the Spark 3.5 based Golang client 
works nicely against a Spark 4 preview build without any issue.

If you look at this then we have multiple releases already where we 
successfully retained parity without any issues.

>From this perspective, I think it's perfectly fine to declare Spark Connect as 
>stable.

On Tue, Jan 21, 2025 at 11:40 PM Holden Karau 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> wrote:
Interesting. So given one of the features of Spark connect should be simpler 
migrations we should (in my mind) only declare it stable once we’ve gone 
through two releases where the previous client + its code can talk to the new 
server.

Twitter: https://twitter.com/holdenkarau<https://twitter.com/holdenkarau >
Fight Health Insurance: 
https://www.fighthealthinsurance.com/<https://www.fighthealthinsurance.com/?q=hk_email
 >
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9 >
YouTube Live Streams: 
https://www.youtube.com/user/holdenkarau<https://www.youtube.com/user/holdenkarau
 >
Pronouns: she/her


On Tue, Jan 21, 2025 at 12:31 PM Dongjoon Hyun 
<dongj...@apache.org<mailto:dongj...@apache.org>> wrote:
It seems that there is misinformation about the stability of Spark Connect in 
Spark 4. I would like to reduce the gap in our dev mailing list.

Frequently, some people claim `Spark Connect` is stable because it uses 
Protobuf. Yes, we standardize the interface layer. However, may I ask if it 
implies its implementation's stability?

Since Apache Spark is an open source community, you can see the stability of 
implementation in our public CI. In our CI, the PySpark Connect client has been 
technically broken most of the time.

1. 
https://github.com/apache/spark/actions/workflows/build_python_connect.yml<https://github.com/apache/spark/actions/workflows/build_python_connect.yml
 >
(Spark Connect Python-only in master)

In addition, the Spark 3.5 client seems to face another difficulty talking with 
Spark 4 server.

2. 
https://github.com/apache/spark/actions/workflows/build_python_connect35.yml<https://github.com/apache/spark/actions/workflows/build_python_connect35.yml
 >
(Spark Connect Python-only:master-server, 35-client)

3. What about the stability and the feature parities in different languages? Do 
they work well with Apache Spark 4? I'm wondering if there is any clue for the 
Apache Spark community to do assessment?

Given (1), (2), and (3), how can we make sure that `Spark Connect` is stable or 
ready in Spark 4? From my perspective, this is still actively under development 
with an open end.

The bottom line is `Spark Connect` needs more community love in order to be 
claimed as Stable in Apache Spark 4. I'm looking forward to seeing the healthy 
Spark Connect CI in Spark 4. Until then, let's clarify what is stable in `Spark 
Connect` and what is not yet.

Best Regards,
Dongjoon.

PS.
This is a seperate thread from the previous flakiness issues.
https://lists.apache.org/thread/r5dzdr3w4ly0dr99k24mqvld06r4mzmq<https://lists.apache.org/thread/r5dzdr3w4ly0dr99k24mqvld06r4mzmq
 >
([FYI] Known `Spark Connect` Test Suite Flakiness)

Reply via email to