Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Long, Andrew
, January 7, 2020 at 12:00 AM To: "Long, Andrew" Cc: "dev@spark.apache.org" Subject: Re: SortMergeJoinExec: Utilizing child partitioning when joining 1. Where can I find information on how to run standard performance tests/benchmarks? 2. Are performance degradations to existing quer

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Brett Marcott
1. Where can I find information on how to run standard performance tests/benchmarks? 2. Are performance degradations to existing queries that are fixable by new equivalent queries not allowed for a new major spark version? On Thu, Jan 2, 2020 at 3:05 PM Brett Marcott wrote: > Thanks for the resp

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-02 Thread Brett Marcott
Thanks for the response Andrew. *1. The approach* The approach I mentioned will not introduce any new skew, so it should only be worsen performance if the user was relying on the shuffle to fix skew they had before. The user can address this by either not introducing their own skewed partition in

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-02 Thread Long, Andrew
“Thoughts on this approach?“ Just to warn you this is a hazardous optimization without cardinality information. Removing columns from the hash exchange reduces entropy potentially resulting in skew. Also keep in mind that if you reduce the number of columns on one side of the join you need todo