Kontinuation commented on PR #24:
URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3273209632

   > I don't see any issues running these without the benchmark suite:
   > 
   > ```python
   > from sedonadb.testing import SedonaDB, DuckDB
   > 
   > sedonadb = SedonaDB()
   > duckdb = DuckDB()
   > 
   > random = sedonadb.con.sql("""
   >     SELECT geometry FROM sd_random_geometry('{
   >                     "geom_type": "Point",
   >                     "target_rows": 10000000,
   >                     "vertices_per_linestring_range": [2, 2]
   >     }')""")
   > 
   > 
   > duckdb.create_table_arrow("random", random)
   > sedonadb.create_table_arrow("random", random)
   > 
   > %time duckdb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM 
random")
   > #> CPU times: user 1min 20s, sys: 2.04 s, total: 1min 22s
   > #> Wall time: 7.53 s
   > 
   > %time sedonadb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM 
random")
   > #> CPU times: user 1min 31s, sys: 3.05 s, total: 1min 35s
   > #> Wall time: 9.75 s
   > ```
   > 
   > Maybe `duckdb` always runs single threaded under `pytest`? Probably it 
makes sense to roll our own benchmark suite.
   
   I did some experiments and found that it is not related to whether we are 
using a test/benchmark framework or not. It is related to the size of dataset.
   
   I tried this workload with various `target_rows` configurations when 
generating the test dataset. Here are the results:
   
   * **target_rows = 100000**: DuckDB uses a single thread, SedonaDB uses 10 
threads
   * **target_rows = 500000**: DuckDB uses 4 threads, SedonaDB uses 10 threads
   * **target_rows = 1000000 or larger**: DuckDB uses 5 threads, SedonaDB uses 
10 threads
   
   Generally, we observe less parallelism when the base table contains less 
data. This is probably related to how DuckDB [estimates thread 
count](https://github.com/duckdb/duckdb/blob/c11813f1b4e89aa7096b8d18ca9eb608e27168a8/src/execution/physical_operator.cpp#L56-L73).
 This behavior is also documented 
[here](https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads.html#the-effect-of-row-groups-on-parallelism).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to