jonathanc-n opened a new issue, #16425:
URL: https://github.com/apache/datafusion/issues/16425

   ### Is your feature request related to a problem or challenge?
   
   This is part of #13181 for looking into different joins.
   
   ## What is a Single Join
   Single joins are similar to a regular left outer join however the build side 
can only have zero or one match with the probe side. If there are two or 
matches for one build side row, it will throw an error during runtime. The 
reason we need a separate join type for this is because other join types do not 
keep track of these number of matches. 
   
   ## What is it used for
   Single joins are used to optimize scalar subqueries. In Neumann's paper for 
unnesting subqueries 
([here](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)),
 he writes on dependent joins, which is just a join where every iteration of 
the left side will incur a search on the right side (right side dependent on 
left).
   
   Here is an example of a scalar subquery:
   ```
   SELECT
     e.name,
     (
       SELECT d.dept_name
       FROM departments d
       WHERE d.dept_id = e.dept_id
     ) AS department
   FROM employees e;
   ```
   
   Scalar subqueries act as a type of dependent join due to the subquery 
needing to be calculated for every row on the left. We can eliminate this by 
creating a hash table out of the subquery side and probing with the build side. 
This cuts the `O(n^2)` runtime to `O(n)`. If the build side matches on more 
than two rows we call an error as it would be an invalid scalar subquery.
   
   
   
   
   
   ### Describe the solution you'd like
   
   - Optimize scalar subqueries into single joins
   - Using the existing hash join execution plan we can create a path for 
single join (duckdb does something similar to this)
   - Do checks on each probe to see if it matched multiple rows; call execution 
error if it did.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to