Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Holden Karau
On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang wrote: > > Gluten currently already support Velox backend and Clickhouse backend. > data fusion support is also proposed but no one worked on it. > > Gluten isn't a POC. It's under actively developing but some companies > already used it. > > > On 2024/

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang
Gluten currently already support Velox backend and Clickhouse backend. data fusion support is also proposed but no one worked on it. Gluten isn't a POC. It's under actively developing but some companies already used it. On 2024/04/11 03:32:01 Dongjoon Hyun wrote: > I'm interested in your cla

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang
Gluten java part is pretty stable now. The development is more in the c++ code, velox code as well as Clickhouse backend. The SPIP doesn't plan to introduce whole Gluten stack into Spark. But the way to serialize Spark physical plan and be able to send to native backend, through JNI or gRPC.

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Binwei Yang
We (Gluten and Arrow guys) actually do planned to put the plan conversation in the substrait-java repo. But to me it makes more sense to put it as part of Spark repo. Native library and accelerator support will be more and more import in future. On 2024/04/10 08:29:08 Wenchen Fan wrote: > It'

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Dongjoon Hyun
I'm interested in your claim. Could you elaborate or provide some evidence for your claim, *a door for all native libraries*, Binwei? For example, is there any POC for that claim? Maybe, did I miss something in that SPIP? Dongjoon. On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang wrote: > > The SP

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Binwei Yang
The SPIP is not for current Gluten, but open a door for all native libraries and accelerators support. On 2024/04/11 00:27:43 Weiting Chen wrote: > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September. > For Spark version support, currently Gluten v1.1.1 support Spark3.2 and 3.3.

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Weiting Chen
Yes, the 1st Apache release(v1.2.0) for Gluten will be in September. For Spark version support, currently Gluten v1.1.1 support Spark3.2 and 3.3. We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0. Spark4.0 support for Gluten is depending on the release schedule in Spark community. On 2

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread L. C. Hsieh
+1 for Wenchen's point. I don't see a strong reason to pull these transformations into Spark instead of keeping them in third party packages/projects. On Wed, Apr 10, 2024 at 5:32 AM Wenchen Fan wrote: > > It's good to reduce duplication between different native accelerators of > Spark, and AFA

Re: Versioning of Spark Operator

2024-04-10 Thread L. C. Hsieh
This approach makes sense to me. If Spark K8s operator is aligned with Spark versions, for example, it uses 4.0.0 now. Because these JIRA tickets are not actually targeting Spark 4.0.0, it will cause confusion and more questions, like when we are going to cut Spark release, should we include Spark

Re: Versioning of Spark Operator

2024-04-10 Thread bo yang
Cool, looks like we have two options here. Option 1: Spark Operator and Connect Go Client versioning independent of Spark, e.g. starting with 0.1.0. Pros: they can evolve versions independently. Cons: people will need an extra step to decide the version when using Spark Operator and Connect Go Cli

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Mich Talebzadeh
I read the SPIP. I have a number of ;points if I may - Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its feature set and stability IMO are still under development. Integrating a non-core component could introduce risks if it is not fully mature - Complexity: integrating Gl

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Wenchen Fan
It's good to reduce duplication between different native accelerators of Spark, and AFAIK there is already a project trying to solve it: https://substrait.io/ I'm not sure why we need to do this inside Spark, instead of doing the unification for a wider scope (for all engines, not only Spark). O

Re: Versioning of Spark Operator

2024-04-10 Thread Dongjoon Hyun
Ya, that would work. Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo. It looks reasonable to me. Although they share the same JIRA, they choose different patterns per place. 1. In POM file and Maven Artifact, independent version number. 1.8.0 2. Tag is also based on th