Hi all,
I just created a JIRA ticket and a work in progress PR.
Here is the link to the JIRA ticket -
https://issues.apache.org/jira/browse/SPARK-52041
Here is the link to the GitHub PR -
https://github.com/apache/spark/pull/50770
I kindly ask for feedback.
Kind regards
On Wed, Feb 19, 2025 at
Hi devs,
Let me pull some spark-submit developers into this discussion.
@dongjoon-hyun @HyukjinKwon @cloud-fan
What are your thoughts on making spark-submit fully and generically
support ExternalClusterManager implementations?
The current situation is that the only way to submit a Spark job
Yes, if this becomes a need that surfaces time and again, then it’s worthwhile to start a broader discussion in a manner of high-level proposal, which could trigger favorable discussion leading to next steps. CheersJules —Sent from my iPhonePardon the dumb thumb typos :)On Feb 7, 2025, at 8:00 AM
Well, everything is possible. Please initiate a discussion on the matter of
a proposal to "Create a pluggable cluster manager" and put it to the
community.
See some examples here
https://lists.apache.org/list.html?dev@spark.apache.org
HTH
Dr Mich Talebzadeh,
Architect | Data Science |
Agreed, If the goal is to make Spark truly pluggable, the spark-submit tool
itself should be more flexible in handling different cluster managers and
their specific requirements.
1. Back in the days, Spark's initial development focused on a limited
set of cluster managers (Standalone,
This External Cluster Manager is an amazing concept and I really like the
separation.
Would it be possible to include a broader group and discuss an approach on
how to make Spark more pluggable? It is a bit far fetched but we would be
very much interested in working on this if this resonates well
To me, this seems like a gap in the "pluggable cluster manager"
implementation.
What is the value of making cluster managers pluggable, if spark-submit
doesn't accept jobs on those cluster managers?
It seems to me, for pluggable cluster managers to work, you would want some
parts
Well you can try using Environment variable and create a custom script
that modifies the --master URL before invoking spark-submit. This script
could replace "k8s://" with another identifier of your choice
"k8s-armada://") and then modify the SparkSubmit code to handle th
scenario would be to edit the SparkSubmit,
which we are trying to avoid because we don't want to touch Spark codebase.
Do you have an idea how to run in cluster deploy mode and load an external
cluster manager?
Could it be possible to submit a PR for a change in SparkSubmit?
Looking forward to
Kubernetes cluster
*as a separate container.
which provides better resource isolation and is more suitable for this type
of cluster you are using Armada
Anyway you can see how it progresses in debugging mode.
HTH
Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
I got it to work by running it in client mode and using the `local://*`
prefix. My external cluster manager gets injected just fine.
On Fri, Feb 7, 2025 at 12:38 AM Dejan Pejchev wrote:
> Hello Spark community!
>
> My name is Dejan Pejchev, and I am a Software Engineer working at
>
| GDPR
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Thu, 6 Feb 2025 at 23:40, Dejan Pejchev wrote:
>
>> Hello Spark community!
>>
>> My name is Dejan Pejchev, and I am a Software Engineer working at
gt; Hello Spark community!
>
> My name is Dejan Pejchev, and I am a Software Engineer working at
> G-Research, and I am a maintainer of our Kubernetes multi-cluster batch
> scheduler called Armada.
>
> We are trying to build an integration with Spark, where we would like to
> use t
Hello Spark community!
My name is Dejan Pejchev, and I am a Software Engineer working at
G-Research, and I am a maintainer of our Kubernetes multi-cluster batch
scheduler called Armada.
We are trying to build an integration with Spark, where we would like to
use the spark-submit with a master
Do we have a Java Client for Spark Connect which is something like PySpark?
From: Mich Talebzadeh
Sent: 22 January 2025 15:05
To: Hyukjin Kwon
Cc: Martin Grund ; Holden Karau
; Dongjoon Hyun ; dev
Subject: [EXTERNAL] Re: FYI: A Hallucination about Spark Connect Stability in
Spark 4
CI
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
On Wed, 22 Jan 2025 at 09:26, Hyukjin Kwon wrote:
> While it might be a bit too much to talk about its stability, it is true
> that the CI dedicated for Spark Connect compat was broken there for a
While it might be a bit too much to talk about its stability, it is true
that the CI dedicated for Spark Connect compat was broken there for a
couple of weeks, and the errors from the tests look confusing.
I agree that tests and builds could be one of the easiest measurements to
tell the state of
I'm very confused about how we use stability in CI as a measure to discuss
the strategy of a particular feature, particularly because we call these
"hallucinations."
>From real-world experience, I can say that we have thousands of clients
using Spark Connect across many differe
Thanks for update and looking into it.
Excuse the thumb typos
On Tue, 21 Jan 2025 at 4:09 PM, Hyukjin Kwon wrote:
> Just a quick note on that: the major reason is 1. OOM we should figure out
> and fix the CI environment. 2. structured streaming test failure that is
> still in de
I'm passionate about and have lots of experience fixing OOMs. Contact me if
you need some help.
El mié, 22 ene 2025, 1:10, Hyukjin Kwon escribió:
> Just a quick note on that: the major reason is 1. OOM we should figure out
> and fix the CI environment. 2. structured streaming test f
Thank you, Hyukjin!
Dongjoon
On Tue, Jan 21, 2025 at 16:10 Hyukjin Kwon wrote:
> Just a quick note on that: the major reason is 1. OOM we should figure out
> and fix the CI environment. 2. structured streaming test failure that is
> still in development.
> I made an umbrella
Just a quick note on that: the major reason is 1. OOM we should figure out
and fix the CI environment. 2. structured streaming test failure that is
still in development.
I made an umbrella JIRA (https://issues.apache.org/jira/browse/SPARK-50907),
and I will work there. Should be easier to look at
Let me take a look. shouldn't be a major issue.
On Wed, 22 Jan 2025 at 08:31, Mich Talebzadeh
wrote:
> As discussed on a thread over the weekend, we agreed among us including
> Matei on a shift towards a more stable and version-independent APIs.
> Spark Connect IMO is a key e
As discussed on a thread over the weekend, we agreed among us including
Matei on a shift towards a more stable and version-independent APIs.
Spark Connect IMO is a key enabler of this shift, allowing users and
developers to build applications and libraries that are more resilient to
changes in
ark 4? From my perspective, this is still actively
> > under development with an open end.
> >
> > The bottom line is `Spark Connect` needs more community love in order to
> > be claimed as Stable in Apache Spark 4. I'm looking forward to seeing the
> >
` needs more community love in order to
> be claimed as Stable in Apache Spark 4. I'm looking forward to seeing the
> healthy Spark Connect CI in Spark 4. Until then, let's clarify what is
> stable in `Spark Connect` and what is not yet.
>
> Best Regards,
> Dongjoon.
>
le in `Spark Connect` and what is not yet.
Best Regards,
Dongjoon.
PS.
This is a seperate thread from the previous flakiness issues.
https://lists.apache.org/thread/r5dzdr3w4ly0dr99k24mqvld06r4mzmq
([FYI] Known `Spark Connect` Test Suite Flakiness)
:
> Thanks for pointing it out! Based on the discussion, I’ve created a PR:
> https://github.com/apache/spark/pull/49534.
>
> Let me know what you think!
>
> On Thu, Jan 16, 2025 at 2:25 PM Xiao Li
> wrote:
>
>> Thank you for pointing it out! Let’s update the template to ex
Thanks for pointing it out! Based on the discussion, I’ve created a PR:
https://github.com/apache/spark/pull/49534.
Let me know what you think!
On Thu, Jan 16, 2025 at 2:25 PM Xiao Li
wrote:
> Thank you for pointing it out! Let’s update the template to exclude
> documentation changes fr
/apache/spark/pull/47756, the behavior changes
were not documented. It’s crucial for all committers to carefully review
the PR titles and descriptions to ensure they are accurate and complete
before merging.
How can we bring more attention to this issue and ensure it becomes a
consistent practice
595c8bb5a58/.github/PULL_REQUEST_TEMPLATE#L34-L36>.”
> The original intent may well have been about behavior changes only, but
> that’s not reflected in the current text of the PR template.
>
>
> On Jan 16, 2025, at 2:32 PM, Dongjoon Hyun
> wrote:
>
> The original intent is a use
cted in the current text of the PR template.
> On Jan 16, 2025, at 2:32 PM, Dongjoon Hyun wrote:
>
> The original intent is a user-facing *behavior* change technically
> which is the same with Apache Spark migration guide.
>
> If so, does it make sense to you?
>
> Probably,
The original intent is a user-facing *behavior* change technically
which is the same with Apache Spark migration guide.
If so, does it make sense to you?
Probably, since the template was short to be concise, it could be
interpreted in more ways than we thought.
Dongjoon.
On Thu, Jan 16, 2025
ggests that the author of this policy did mean that yes, even a
typo fix in a user-facing documentation page merits a “yes” response to this
question.
It’s strict but it’s also clear and unambiguous. No one has to think about
whether their user-facing change is big enough to merit a yes.
IMO t
I understand your concern, Nicholas. However, isn't it too strict?
For the above example, adding a new HTML page is a user-facing change.
https://github.com/apache/spark/pull/48852 (This is a new doc)
[SPARK-50309][DOCS] Document SQL Pipe Syntax
https://github.com/apache/spark/pull/49098
This is not a big deal at all, but I figure it’s worth bringing up briefly
because the pull request template does emphasize
<https://github.com/apache/spark/blob/ffb31565e5af6f9ab2f8f7b500fbd595c8bb5a58/.github/PULL_REQUEST_TEMPLATE#L34-L36>:
> ### Does this PR introduce _any_ us
uced at least two subtle bugs
> that many reviewers weren't able to catch and those two bugs would not have
> been possible to introduce if we had a single pass analyzer. Single pass
> can make the whole framework more robust.
>
>
>
>
>
>
> On Tue, Aug 20, 2024 a
+1 on this too
When I implemented "group by all", I introduced at least two subtle bugs
that many reviewers weren't able to catch and those two bugs would not have
been possible to introduce if we had a single pass analyzer. Single pass
can make the whole framework more robust.
This sounds like a good idea!
The Analyzer is complex. The changes in the new Analyzer should not affect
the existing one. The users could add the QO rules and rely on the existing
structures and patterns of the logical plan trees generated by the current
one. The new Analyzer needs to generate
Congratulations !
发件人: Matei Zaharia
日期: 2024年8月14日 星期三 06:03
收件人: Wenchen Fan
抄送: Ruifeng Zheng , Martin Grund
, Peter Toth , dev
主题: [外部邮件] Re: Welcoming a new PMC member
Congrats and welcome Kent!
On Aug 13, 2024, at 7:27 AM, Wenchen Fan wrote:
Congratulations!
On Tue, Aug 13, 2024
2: Support main
datasources, ...). Running both analyzers in mixed mode may lead to
unexpected logical plan problems, because that would introduce a completely
different chain of transformations
On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell
wrote:
> +1(000) on this!
>
> This should
+1(000) on this!
This should massively reduce allocations done in the analyzer, and it is
much more efficient. I also can't count the times that I had to increase
the number of iterations. This sounds like a no-brainer to me.
I do have two questions:
- How do we ensure that we
>>>>> On Mon, Aug 12, 2024 at 8:46 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com <mailto:dongjoon.h...@gmail.com>> wrote:
> > >>>>>> Congratulations, Kent.
> > >>>>>>
> > >>>>>> Dongjoon.
> &
;> Congratulations Kent !
> >>>>>
> >>>>> Regards,
> >>>>> Mridul
> >>>>>
> >>>>> On Mon, Aug 12, 2024 at 8:46 PM Dongjoon Hyun >>>>> <mailto:dongjoon.h...@gmail.com>> wrote:
> >>>
gt;>>> <mailto:dongjoon.h...@gmail.com>> wrote:
>>>>>> Congratulations, Kent.
>>>>>>
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li >>>>> <mailto:gatorsm...@gmail.com>> wrote:
>>>>>>> Congratulations !
>>>>>>>
>>>>>>> Hyukjin Kwon mailto:gurwls...@apache.org>>
>>>>>>> 于2024年8月12日周一 17:20写道:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join
>>>>>>>> me in welcoming him to his new role!
>>>>>>>>
n, Aug 12, 2024 at 8:46 PM Dongjoon Hyun
>>>> wrote:
>>>>
>>>>> Congratulations, Kent.
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>>>>>
>>>>>> Congratulations !
>>>>>>
>>>>>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join
>>>>>>> me in welcoming him to his new role!
>>>>>>>
>>>>>>>
;
>>>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>>>>
>>>>> Congratulations !
>>>>>
>>>>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join
>>>>>> me in welcoming him to his new role!
>>>>>>
>>>>>>
; On Mon, Aug 12, 2024 at 8:46 PM Dongjoon Hyun
>> wrote:
>>
>>> Congratulations, Kent.
>>>
>>> Dongjoon.
>>>
>>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>>>
>>>> Congratulations !
>>>>
>>>> Hyukjin Kwon
;
>> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>>
>>> Congratulations !
>>>
>>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>>>> in welcoming him to his new role!
>>>>
>>>>
>>> Congratulations !
>>>
>>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>>>> in welcoming him to his new role!
>>>>
>>>>
Congrats, Kent!
On Tue, Aug 13, 2024 at 9:06 AM Dongjoon Hyun
wrote:
> Congratulations, Kent.
>
> Dongjoon.
>
> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>
>> Congratulations !
>>
>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>
>>> Hi all,
&g
;>
>>> Congratulations !
>>>
>>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>>>> in welcoming him to his new role!
>>>>
>>>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>>> in welcoming him to his new role!
>>>
>>>
Congrats, Kent!
On Tue, Aug 13, 2024 at 10:06 AM Dongjoon Hyun
wrote:
> Congratulations, Kent.
>
> Dongjoon.
>
> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
>
>> Congratulations !
>>
>> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>>
>>> Hi all,
&g
Congratulations!
Yuming Wang 于2024年8月13日周二 08:28写道:
>
> Congratulations!
>
> On Mon, Aug 12, 2024 at 5:20 PM Hyukjin Kwon wrote:
>>
>> Hi all,
>>
>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
Congratulations!
On Mon, Aug 12, 2024 at 5:20 PM Hyukjin Kwon wrote:
> Hi all,
>
> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me in
> welcoming him to his new role!
>
>
Congratulations, Kent.
Dongjoon.
On Mon, Aug 12, 2024 at 5:22 PM Xiao Li wrote:
> Congratulations !
>
> Hyukjin Kwon 于2024年8月12日周一 17:20写道:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me
>> in welcoming him to his new role!
>>
>>
Congratulations !
Hyukjin Kwon 于2024年8月12日周一 17:20写道:
> Hi all,
>
> The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me in
> welcoming him to his new role!
>
>
Hi all,
The Spark PMC recently voted to add a new PMC member, Kent Yao. Join me in
welcoming him to his new role!
unobvious, so it’s hard to introduce changes without having the full
knowledge. By modifying one rule, the whole chain of transformations can
change in an unobvious way. Since we can hit the maximum number of
iterations, there’s no guarantee that the plan is going to be resolved. And
from a
tps://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024
e
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo wrote:
> >
> >
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
Could any one help me here ?
Sent from my iPhone
> On May 7, 2024, at 4:30 PM, Prem Sahoo wrote:
>
>
> Hello Folks,
> in Spark I have read a file and done some transformation and finally writing
> to hdfs.
>
> Now I am interested in writing the same dataframe to MapR
Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.
Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again (recompute all the previous
steps)(all the read + transformations ).
I don't
On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang wrote:
>
> Gluten currently already support Velox backend and Clickhouse backend.
> data fusion support is also proposed but no one worked on it.
>
> Gluten isn't a POC. It's under actively developing but some companies
> al
Gluten currently already support Velox backend and Clickhouse backend. data
fusion support is also proposed but no one worked on it.
Gluten isn't a POC. It's under actively developing but some companies already
used it.
On 2024/04/11 03:32:01 Dongjoon Hyun wrote:
> I'm
I'm interested in your claim.
Could you elaborate or provide some evidence for your claim, *a door for
all native libraries*, Binwei?
For example, is there any POC for that claim? Maybe, did I miss something
in that SPIP?
Dongjoon.
On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang wrote:
&g
The SPIP is not for current Gluten, but open a door for all native libraries
and accelerators support.
On 2024/04/11 00:27:43 Weiting Chen wrote:
> Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> For Spark version support, currently Gluten v1.1.1 support Spark3.2 a
project is still under active development now, and doesn't have a
> stable release.
> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>
> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> support.
> And, 3.4 will have 3.4.3 next
Thank you for sharing, Weiting.
Do you think you can share the future milestone of Apache Gluten?
I'm wondering when the first stable release will come and how we can
coordinate across the ASF communities.
> This project is still under active development now, and doesn't have a
s
Hi all,
We are excited to introduce a new Apache incubating project called Gluten.
Gluten serves as a middleware layer designed to offload Spark to native
engines like Velox or ClickHouse.
For more detailed information, please visit the project repository at
https://github.com/apache/incubator
> Some of you may be aware that Databricks community Home | Databricks
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> good idea for the Apache Spark user group to have the same, especially
>>> for repeat questions on Spark core, Spark SQL, Spa
I concur. Whilst Databricks' (a commercial entity) Knowledge Sharing Hub
can be a useful resource for sharing knowledge and engaging with their
respective community, ASF likely prioritizes platforms and channels that
align more closely with its principles of open source, and vendor
neutr
ASF will be unhappy about this. and stack overflow exists. otherwise:
apache Confluent and linkedIn exist; LI is the option I'd point at
On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh
wrote:
> Some of you may be aware that Databricks community Home | Databricks
> have just launched
n entertain this idea. They
seem to have a well defined structure for hosting topics.
Let me know your thoughts
Thanks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kin
+1 Great initiative.
QQ : Stack overflow has a similar feature called "Collectives", but I am
not sure of the expenses to create one for Apache Spark. With SO being used
( atleast before ChatGPT became quite the norm for searching questions), it
already has a lot of questions asked an
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh
>> *Cc: *Matei Zaharia
>> *Subject: *R
org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID
OK thanks for the update.
What does officially blessed signify here? Can we have and run it as a
sister site? The reason this comes to my mind is that the interested
parties should have easy access to this site (from ISUG Spark sites) as a
reference repository. I guess the advice would be that
;>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* ashok34668@ yahoo. com. INVAL
ars 2024 kl. 17:26 skrev Parsian, Mahmoud
> :
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *
y, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> dev@spark.apache.org>, Mich Talebzadeh
> *Cc: *Matei Zaharia
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking l
> dev@spark.apache.org>, Mich Talebzadeh
> *Cc: *Matei Zaharia
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and
> shuffle and better memory management have been introduced, we plan to
> publish the benchmark results (at least TPC-H) in the repo.
>
> > Compared to standard Spark, what kind of performance gains can be
> expected with Comet?
>
> Currently, users could benefit from Comet in a few a
epo.
> Compared to standard Spark, what kind of performance gains can be
expected with Comet?
Currently, users could benefit from Comet in a few areas:
- Parquet read: a few improvements have been made against reading from S3
in particular, so users can expect better scan performance in this sc
Hi Chao,
As a cool feature
- Compared to standard Spark, what kind of performance gains can be
expected with Comet?
- Can one use Comet on k8s in conjunction with something like a Volcano
addon?
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
ources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.
On Thu, 15 Feb 2024 at 01:18, Chao Sun wrote:
> Hi Praveen,
>
> We will add a "Getting Started" sectio
Hi Praveen,
We will add a "Getting Started" section in the README soon, but basically
comet-spark-shell
<https://github.com/apache/arrow-datafusion-comet/blob/main/bin/comet-spark-shell>
in
the repo should provide a basic tool to build Comet and launch a Spark
shell with it.
Note
wrote:
> >>
> >> Absolutely thrilled to see the project going open-source! Huge congrats
> to Chao and the entire team on this milestone!
> >>
> >> Yufei
> >>
> >>
> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote:
> >>>
team on this milestone!
>>
>> Yufei
>>
>>
>> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote:
>>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leve
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> ht
Absolutely thrilled to see the project going open-source! Huge congrats to
Chao and the entire team on this milestone!
Yufei
On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote:
> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query e
Sure thanks for clarification. I gather what you are alluding to is -- in
a distributed environment, when one does operations that involve shuffling
or repartitioning of data, the order in which this data is processed across
partitions is not guaranteed. So when repartitioning a dataframe, the
Apologies if it wasn't clear, I was meaning the difficulty of debugging,
not floating point precision :)
On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh
wrote:
> Hi Jack,
>
> " most SQL engines suffer from the same issue... ""
>
> Sure. This behavior is
This looks really cool :) Out of interest what are the differences in the
approach between this and Glutten?
On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote:
> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveragin
Hi all,
We are very happy to announce that Project Comet, a plugin to
accelerate Spark query execution via leveraging DataFusion and Arrow,
has now been open sourced under the Apache Arrow umbrella. Please
check the project repo
https://github.com/apache/arrow-datafusion-comet for more details if
Hi Jack,
" most SQL engines suffer from the same issue... ""
Sure. This behavior is not a bug, but rather a consequence of the
limitations of floating-point precision. The numbers involved in the
example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorre
I may be ignorant of other debugging methods in Spark but the best success
I've had is using smaller datasets (if runs take a long time) and adding
intermediate output steps. This is quite different from application
development in non-distributed systems where a debugger is trivial to
attach
1 - 100 of 1086 matches
Mail list logo