i have been doing tests with iterative algorithms that do caching/uncaching
at each iteration and i see improvements when i turn on AQE for cache.
now i am wondering... with an iterative algo using AQE it is true that the
output of every iteration can have a slightly different number of
partitions
It would break CachedTableSuite."A cached table preserves the partitioning
and ordering of its cached SparkPlan" if AQE was turned on.
Anyway, the chance of this outputPartitioning being useful is rather low
and should not justify turning off AQE for SQL cache.
On Thu, Aug 20, 2020 at 10:54 PM Ko
No. The worst case of enabling AQE in cached data is not losing the
opportunity of using/reusing the cache, but rather just an extra shuffle if
the outputPartitioning happens to match without AQE and not match after
AQE. The chance of this happening is rather low.
On Thu, Aug 20, 2020 at 12:09 PM
i see. it makes sense to maximize re-use of cached data. i didn't realize
we have two potentially conflicting goals here.
On Thu, Aug 20, 2020 at 12:41 PM Maryann Xue
wrote:
> AQE has been turned off deliberately so that the `outputPartitioning` of
> the cached relation won't be changed by AQE
AQE has been turned off deliberately so that the `outputPartitioning` of
the cached relation won't be changed by AQE partition coalescing or skew
join optimization and the outputPartitioning can potentially be used by
relations built on top of the cache.
On a second thought, we should probably add
we tend to have spark.sql.shuffle.partitions set very high by default
simply because some jobs need it to be high and it's easier to then just
set the default high instead of having people tune it manually per job. the
main downside is lots of part files which leads to pressure on the driver,
and d