Re: AQE effectiveness

2020-09-29 Thread Koert Kuipers
i have been doing tests with iterative algorithms that do caching/uncaching at each iteration and i see improvements when i turn on AQE for cache. now i am wondering... with an iterative algo using AQE it is true that the output of every iteration can have a slightly different number of partitions

Re: AQE effectiveness

2020-08-21 Thread Maryann Xue
It would break CachedTableSuite."A cached table preserves the partitioning and ordering of its cached SparkPlan" if AQE was turned on. Anyway, the chance of this outputPartitioning being useful is rather low and should not justify turning off AQE for SQL cache. On Thu, Aug 20, 2020 at 10:54 PM Ko

Re: AQE effectiveness

2020-08-20 Thread Maryann Xue
No. The worst case of enabling AQE in cached data is not losing the opportunity of using/reusing the cache, but rather just an extra shuffle if the outputPartitioning happens to match without AQE and not match after AQE. The chance of this happening is rather low. On Thu, Aug 20, 2020 at 12:09 PM

Re: AQE effectiveness

2020-08-20 Thread Koert Kuipers
i see. it makes sense to maximize re-use of cached data. i didn't realize we have two potentially conflicting goals here. On Thu, Aug 20, 2020 at 12:41 PM Maryann Xue wrote: > AQE has been turned off deliberately so that the `outputPartitioning` of > the cached relation won't be changed by AQE

Re: AQE effectiveness

2020-08-20 Thread Maryann Xue
AQE has been turned off deliberately so that the `outputPartitioning` of the cached relation won't be changed by AQE partition coalescing or skew join optimization and the outputPartitioning can potentially be used by relations built on top of the cache. On a second thought, we should probably add

AQE effectiveness

2020-08-20 Thread Koert Kuipers
we tend to have spark.sql.shuffle.partitions set very high by default simply because some jobs need it to be high and it's easier to then just set the default high instead of having people tune it manually per job. the main downside is lots of part files which leads to pressure on the driver, and d