Kontinuation commented on PR #24:
URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3273209632
> I don't see any issues running these without the benchmark suite:
>
> ```python
> from sedonadb.testing import SedonaDB, DuckDB
>
> sedonadb = SedonaDB()
> duckdb = DuckDB()
>
> random = sedonadb.con.sql("""
> SELECT geometry FROM sd_random_geometry('{
> "geom_type": "Point",
> "target_rows": 10000000,
> "vertices_per_linestring_range": [2, 2]
> }')""")
>
>
> duckdb.create_table_arrow("random", random)
> sedonadb.create_table_arrow("random", random)
>
> %time duckdb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM
random")
> #> CPU times: user 1min 20s, sys: 2.04 s, total: 1min 22s
> #> Wall time: 7.53 s
>
> %time sedonadb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM
random")
> #> CPU times: user 1min 31s, sys: 3.05 s, total: 1min 35s
> #> Wall time: 9.75 s
> ```
>
> Maybe `duckdb` always runs single threaded under `pytest`? Probably it
makes sense to roll our own benchmark suite.
I did some experiments and found that it is not related to whether we are
using a test/benchmark framework or not. It is related to the size of dataset.
I tried this workload with various `target_rows` configurations when
generating the test dataset. Here are the results:
* **target_rows = 100000**: DuckDB uses a single thread, SedonaDB uses 10
threads
* **target_rows = 500000**: DuckDB uses 4 threads, SedonaDB uses 10 threads
* **target_rows = 1000000 or larger**: DuckDB uses 5 threads, SedonaDB uses
10 threads
Generally, we observe less parallelism when the base table contains less
data. This is probably related to how DuckDB [estimates thread
count](https://github.com/duckdb/duckdb/blob/c11813f1b4e89aa7096b8d18ca9eb608e27168a8/src/execution/physical_operator.cpp#L56-L73).
This behavior is also documented
[here](https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads.html#the-effect-of-row-groups-on-parallelism).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]