Re: Bridging gap between Spark UI and Code

2021-05-24 Thread mhawes
@Wenchen Fan, understood that the mapping of query plan to application code is very hard. I was wondering if we might be able to instead just handle the mapping from the final physical plan to the stage graph. So for example you’d be able to tell what part of the plan generated which stages. I feel

Re: Bridging gap between Spark UI and Code

2021-05-21 Thread mhawes
Reviving this thread to ask whether any of the Spark maintainers would consider helping to scope a solution for this. Michal outlines the problem in this thread, but to clarify. The issue is that for very complex spark application where the Logical Plans often span many pages, it is extremely hard

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-21 Thread mhawes
Adding /another/ update to say that I'm currently planning on using a recently introduced feature whereby calling `.repartition()` with no args will cause the dataset to be optimised by AQE. This actually suits our use-case perfectly! Example: sparkSession.conf().set("spark.sql.adaptive.e

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-11 Thread mhawes
Hi angers.zhu, Reviving this thread to say that while it's not ideal (as it recomputes the last stage) I think the `SizeBasedCoaleaser` solution seems like a good option. If you don't mind re-raising that PR that would be great. Alternatively I'm happy to make the PR based on your previous PR? Wh

Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-31 Thread mhawes
Okay from looking closer at some of the code, I'm not sure that what I'm asking for in terms of adaptive execution makes much sense as it can only happen between stages. I.e. optimising future /stages/ based on the results of previous stages. Thus an "on-demand" adaptive coalesce doesn't make much

Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-31 Thread mhawes
Hi angers.zhu, Thanks for pointing me towards that PR, I think the main issue there is that the coalesce operation requires an additional computation which in this case is undesirable. It also approximates the row size rather than just directly using the partition size. Thus it has the potential t

Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi Pol, I had considered repartitioning but the main issue for me there is that it will trigger a shuffle and could significantly slow down the query/application as a result. Thanks for contributing that as an alternative suggestion though :) -- Sent from: http://apache-spark-developers-list.10

[Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi all, Sending this first before creating a jira issue in an effort to start a discussion :) Problem: We have a situation where we end with a very large number (O(10K)) of partitions, with very little data in most partitions but a lot of data in some of them. This not only causes slow execution