2010YOUY01 commented on issue #17488: URL: https://github.com/apache/datafusion/issues/17488#issuecomment-3274655352
Yes, this reproducer totally makes sense. DF in general is optimized for large inputs (batch size is several K, and the input consists of multiple batches), in such case we can expect an order of magnitude faster than traditional systems like Postgres, otherwise, with small inputs, the performance can degrade to close to traditional systems. For this specific NLJ operator, it's inner logic is ``` for each right_batch: for each left_row: join(left_row, right_batch) ``` the inner-most `join()` function is optimized for large right batch with classical vectorization tricks. For large batch size, the amortized per-row cost will be very row; if this batch has only one row there is nothing to amortize. The DF 49 version is using another high-level idea so that this small right case can be handled very efficiently, while at some other cost like huge memory usage. Besides, this NLJ operator also assumes that the left side is the smaller side. This kind of workload is typically not optimized, mostly due to engineering cost — it's easier to implement something fast if users ensure it's used under specific constraints (in this case, large inputs with the smaller side on the left). However, after checking the code, it seems fixable with some simple rules. I'll give it a try later this week, but if I find that it introduces too much extra complexity, I might give up considering the long-term maintainability. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org