, January 7, 2020 at 12:00 AM
To: "Long, Andrew"
Cc: "dev@spark.apache.org"
Subject: Re: SortMergeJoinExec: Utilizing child partitioning when joining
1. Where can I find information on how to run standard performance
tests/benchmarks?
2. Are performance degradations to existing quer
1. Where can I find information on how to run standard performance
tests/benchmarks?
2. Are performance degradations to existing queries that are fixable by new
equivalent queries not allowed for a new major spark version?
On Thu, Jan 2, 2020 at 3:05 PM Brett Marcott
wrote:
> Thanks for the resp
Thanks for the response Andrew.
*1. The approach*
The approach I mentioned will not introduce any new skew, so it should only
be worsen performance if the user was relying on the shuffle to fix skew
they had before.
The user can address this by either not introducing their own skewed
partition in
“Thoughts on this approach?“
Just to warn you this is a hazardous optimization without cardinality
information. Removing columns from the hash exchange reduces entropy
potentially resulting in skew. Also keep in mind that if you reduce the number
of columns on one side of the join you need todo