Re: [PR] Run queries in python benchmarks using only one thread [sedona-db]

via GitHub Mon, 06 Oct 2025 18:14:55 -0700


petern48 commented on PR #24:
URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3374800284


   > I can see how there are two types of benchmarks that are equally valuable:
   > 
   > * Integration-style benchmarks that use the defaults and read from Parquet 
(e.g., check the perceived speed of a relatively realistic query)
   > * Unit-style benchmarks that are just a way to check if our particular 
implementation of a scalar function/iteration overhead is reasonable. Running 
these from memory on one thread (or the same number of threads) is possibly 
more comparable but forcing a single thread is maybe unrealistic because in 
practice some of our per-batch and per-item overhead is amortized over multiple 
threads and is possibly not something we should spend time optimizing yet.
   
   I agree with this. Which one of these do we care more about (to be the 
default)? 1) realistic benchmarks for "bragging" to the public OR 2) benchmarks 
meant for development and understanding how good our function implementations 
are.
   
   For 2), sure maybe using one thread would be better for comparing for 
development purposes. In that case, I'll mention again that pre-loading the 
parquet data (instead of using a view) would further isolate the comparison to 
the function implementations.
   
   For 1), I don't think forcing them both to one thread would be more "fair." 
How well the engines leverage all the threads they have access to is a core 
part of how "good" they are. Ok here comes my main point:
   
   > I did some experiments and found that it is not related to whether we are 
using a test/benchmark framework or not. It is related to the size of dataset.
   
   I interpret this to mean that DuckDB decides to use one thread because it 
thinks this will be optimal compared to using more (correct me if I'm wrong), 
since multithreading comes with an overhead. If one engine *chooses* to only 
use one thread when it's given multiple, that shouldn't force everyone else to 
only use one. It might be right, it might be wrong. If you want perf numbers to 
showcase publicly, we could increase the dataset size until DuckDB uses as many 
threads, to kill any potential argument for an "unfair" benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Run queries in python benchmarks using only one thread [sedona-db]

Reply via email to