petern48 commented on PR #24: URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3374800284
> I can see how there are two types of benchmarks that are equally valuable: > > * Integration-style benchmarks that use the defaults and read from Parquet (e.g., check the perceived speed of a relatively realistic query) > * Unit-style benchmarks that are just a way to check if our particular implementation of a scalar function/iteration overhead is reasonable. Running these from memory on one thread (or the same number of threads) is possibly more comparable but forcing a single thread is maybe unrealistic because in practice some of our per-batch and per-item overhead is amortized over multiple threads and is possibly not something we should spend time optimizing yet. I agree with this. Which one of these do we care more about (to be the default)? 1) realistic benchmarks for "bragging" to the public OR 2) benchmarks meant for development and understanding how good our function implementations are. For 2), sure maybe using one thread would be better for comparing for development purposes. In that case, I'll mention again that pre-loading the parquet data (instead of using a view) would further isolate the comparison to the function implementations. For 1), I don't think forcing them both to one thread would be more "fair." How well the engines leverage all the threads they have access to is a core part of how "good" they are. Ok here comes my main point: > I did some experiments and found that it is not related to whether we are using a test/benchmark framework or not. It is related to the size of dataset. I interpret this to mean that DuckDB decides to use one thread because it thinks this will be optimal compared to using more (correct me if I'm wrong), since multithreading comes with an overhead. If one engine *chooses* to only use one thread when it's given multiple, that shouldn't force everyone else to only use one. It might be right, it might be wrong. If you want perf numbers to showcase publicly, we could increase the dataset size until DuckDB uses as many threads, to kill any potential argument for an "unfair" benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
