(sedona-spatialbench) branch main updated: docs: add overview and methodology (#19)

jiayu Fri, 19 Sep 2025 11:28:49 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/sedona-spatialbench.git



The following commit(s) were added to refs/heads/main by this push:
     new 25bd460  docs: add overview and methodology (#19)
25bd460 is described below

commit 25bd460bfa5c08bee9db4a69836799a656eedd7d
Author: Matthew Powers <[email protected]>
AuthorDate: Fri Sep 19 14:26:39 2025 -0400

    docs: add overview and methodology (#19)
---
 docs/overview-methodology.md | 75 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/docs/overview-methodology.md b/docs/overview-methodology.md
index 0a2cd02..8abb466 100644
--- a/docs/overview-methodology.md
+++ b/docs/overview-methodology.md
@@ -19,4 +19,77 @@ title: SpatialBench Overview and Methodology
   under the License.
 -->
 
-## TODO
\ No newline at end of file
+# SpatialBench Overview and Methodology
+
+SpatialBench is an open benchmark suite of representative spatial queries 
designed to evaluate the performance of different engines at multiple scale 
factors.
+
+The SpatialBench queries are a great way to compare the relative performance 
between engines for analytical spatial workloads.  You can use a small scale 
factor for single-machine queries, and a large scale factor to benchmark an 
engine that distributes computations in the cloud.
+
+Let’s take a deeper look at why SpatialBench is so essential.
+
+## Why SpatialBench?
+
+Spatial workflows encompass queries such as spatial joins, spatial filtering, 
and spatial-specific operations, including KNN joins.
+
+General-purpose analytics query benchmarks don’t cover spatial queries.  They 
focus on analytical queries, such as joins and aggregations, on tabular data. 
Here are some popular analytical benchmarks:
+
+* [TPC-H](https://www.tpc.org/tpch/)
+* [TPC-DS](https://www.tpc.org/tpcds/)
+* [ClickBench](https://benchmark.clickhouse.com/)
+* [YCSB](https://github.com/brianfrankcooper/YCSB)
+* [db-benchmark](https://duckdblabs.github.io/db-benchmark/)
+
+The analytical benchmarks help analyze analytical performance, but that 
doesn’t necessarily translate to spatial queries.  An engine can be blazing 
fast for a large tabular aggregation and terrible for spatial joins.
+
+SpatialBench is tailored for spatial queries.  It’s the best modern option to 
assess the spatial performance of an engine.  Let’s take a look at some of the 
older spatial benchmarks.
+
+## Hardware and software
+
+SpatialBench runs benchmarks on commodity hardware, with software versions 
fully disclosed for each release.
+
+When comparing different runtimes, developers should make a good-faith effort 
to use similar hardware and software versions.  It’s not helpful to compare one 
runtime with another runtime that has a lot less computational power.
+
+SpatialBench benchmarks should always be presented with associated 
hardware/software specifications so readers can assess the reliability of the 
comparison.
+
+## Accurately comparing different engines
+
+It is challenging to compare fundamentally different engines, such as PostGIS 
(an OLTP database), DuckDB (an OLAP database), and GeoPandas (a Python engine).
+
+For example, let’s compare how two engines execute a query differently:
+
+* PostGIS: create tables, load data into the tables, build an index (can be 
expensive), run the query
+* GeoPandas: read data into memory and run a query
+
+PostGIS and GeoPandas execute queries differently, so you need to present the 
query runtime with caution.  For example, you can’t just ignore the time it 
takes to build the PostGIS index because that can be the slowest part of the 
query.  That’s a critical detail for users running ad hoc queries.
+
+The SpatialBench results strive to present runtimes for all relevant portions 
of the query so users are best informed about how to interpret the results.
+
+## Engine tuning in benchmarks
+
+Engines can be tuned by configuring settings or optimizing code.  For example, 
you can optimize Spark code by tuning the JVM.  You can optimize GeoPandas code 
by adding indexes.  Benchmarks that tune one engine and don’t tune any of the 
other engines aren’t reliable.
+
+All performance tuning is fully disclosed in the SpatialBench results.  Some 
results are presented both naively and fully tuned to give a better picture of 
out-of-the-box performance and what’s possible for expert users.
+
+## Open source benchmarks vs. vendor benchmarks
+
+The SpatialBench benchmarks report results for some open source spatial 
engines/databases.
+
+The SpatialBench repository does not report results for any proprietary 
engines or vendor runtimes.  Vendors are free to use the SpatialBench data 
generators and run the benchmarks on their own.  We ask vendors to credit 
SpatialBench when they run the benchmarks and fully disclose the results so 
that other practitioners can reproduce the results.
+
+## How to contribute
+
+There are a variety of ways to contribute to the SpatialBench project:
+
+* Submit [pull requests](https://github.com/apache/sedona-spatialbench/pulls) 
to add features
+* Create [issues](https://github.com/apache/sedona-spatialbench/issues) for 
bug reports
+* Reproduce results or help add new spatial engines
+* Publish vendor benchmarks
+
+Here is how you can communicate with the team:
+
+* Chat with us on the [Apache Sedona Discord](https://discord.gg/9A3k5dEBsY)
+* Create [GitHub Discussions](https://github.com/apache/sedona/discussions)
+
+## Future work
+
+In the next release, we will add raster datasets and raster queries.  These 
will stress test an engine’s ability to analyze raster data.  They will also 
show performance when joining vector and raster datasets.

(sedona-spatialbench) branch main updated: docs: add overview and methodology (#19)

Reply via email to