@Wenchen Fan, understood that the mapping of query plan to application code
is very hard. I was wondering if we might be able to instead just handle the
mapping from the final physical plan to the stage graph. So for example
you’d be able to tell what part of the plan generated which stages. I feel
Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard
Adding /another/ update to say that I'm currently planning on using a
recently introduced feature whereby calling `.repartition()` with no args
will cause the dataset to be optimised by AQE. This actually suits our
use-case perfectly!
Example:
sparkSession.conf().set("spark.sql.adaptive.e
Hi angers.zhu,
Reviving this thread to say that while it's not ideal (as it recomputes the
last stage) I think the `SizeBasedCoaleaser` solution seems like a good
option. If you don't mind re-raising that PR that would be great.
Alternatively I'm happy to make the PR based on your previous PR?
Wh
Okay from looking closer at some of the code, I'm not sure that what I'm
asking for in terms of adaptive execution makes much sense as it can only
happen between stages. I.e. optimising future /stages/ based on the results
of previous stages. Thus an "on-demand" adaptive coalesce doesn't make much
Hi angers.zhu,
Thanks for pointing me towards that PR, I think the main issue there is that
the coalesce operation requires an additional computation which in this case
is undesirable. It also approximates the row size rather than just directly
using the partition size. Thus it has the potential t
Hi Pol, I had considered repartitioning but the main issue for me there is
that it will trigger a shuffle and could significantly slow down the
query/application as a result. Thanks for contributing that as an
alternative suggestion though :)
--
Sent from: http://apache-spark-developers-list.10
Hi all, Sending this first before creating a jira issue in an effort to start
a discussion :)
Problem:
We have a situation where we end with a very large number (O(10K)) of
partitions, with very little data in most partitions but a lot of data in
some of them. This not only causes slow execution