2010YOUY01 commented on issue #17488:
URL: https://github.com/apache/datafusion/issues/17488#issuecomment-3274655352

   Yes, this reproducer totally makes sense.
   
   DF in general is optimized for large inputs (batch size is several K, and 
the input consists of multiple batches), in such case we can expect an order of 
magnitude faster than traditional systems like Postgres, otherwise, with small 
inputs, the performance can degrade to close to traditional systems.
   
   For this specific NLJ operator, it's inner logic is
   ```
   for each right_batch:
       for each left_row:
           join(left_row, right_batch)
   ```
   the inner-most `join()` function is optimized for large right batch with 
classical vectorization tricks. For large batch size, the amortized per-row 
cost will be very row; if this batch has only one row there is nothing to 
amortize.
   
   The DF 49 version is using another high-level idea so that this small right 
case can be handled very efficiently, while at some other cost like huge memory 
usage.
   
   Besides, this NLJ operator also assumes that the left side is the smaller 
side. This kind of workload is typically not optimized, mostly due to 
engineering cost — it's easier to implement something fast if users ensure it's 
used under specific constraints (in this case, large inputs with the smaller 
side on the left).
   
   However, after checking the code, it seems fixable with some simple rules. 
I'll give it a try later this week, but if I find that it introduces too much 
extra complexity, I might give up considering the long-term maintainability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to