This is an automated email from the ASF dual-hosted git repository. jiayu pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/sedona-spatialbench.git
commit a0e928cb529baa76d71fa84309a36bc4ad81592e Author: Pranav Toggi <[email protected]> AuthorDate: Mon Jul 7 15:42:05 2025 -0700 update readme --- README.md | 215 ++++++++++++++++++++------------------- images/data_model.png | Bin 0 -> 124184 bytes images/spatial_distributions.png | Bin 0 -> 990093 bytes 3 files changed, 113 insertions(+), 102 deletions(-) diff --git a/README.md b/README.md index 9bc0807..7af6fbb 100644 --- a/README.md +++ b/README.md @@ -1,145 +1,156 @@ -# tpchgen-rs +# SpatialBench -[![Apache licensed][license-badge]][license-url] -[![Build Status][actions-badge]][actions-url] +SpatialBench is a high-performance geospatial benchmark for generating synthetic spatial data at scale. Inspired by the Star Schema Benchmark (SSB) and real-world mobility data like the NYC TLC dataset, SpatialBench is designed to evaluate spatial query performance in modern data platforms. -[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg -[license-url]: https://github.com/clflushopt/tpchgen-rs/blob/main/LICENSE -[actions-badge]: https://github.com/clflushopt/tpchgen-rs/actions/workflows/rust.yml/badge.svg -[actions-url]: https://github.com/clflushopt/tpchgen-rs/actions?query=branch%3Amain +Built in Rust and powered by Apache Arrow, SpatialBench brings fast, scalable, and streaming-friendly data generation for spatial workloads—minimal dependencies, blazing speed. -Blazing fast [TPCH] benchmark data generator, in pure Rust with zero dependencies. +SpatialBench provides a reproducible and scalable way to evaluate the performance of spatial data engines using realistic synthetic workloads. -[TPCH]: https://www.tpc.org/tpch/ +Goals: -## Features +- Establish a fair and extensible benchmark suite for spatial data processing. +- Help users compare engines and frameworks across different data scales. +- Support open standards and foster collaboration in the spatial computing community. -1. Blazing Speed 🚀 -2. Obsessively Tested 📋 -3. Fully parallel, streaming, constant memory usage 🧠 +## Data Model -## Try it now! +SpatialBench defines a spatial star schema with the following tables: -### Install Using Python -Install this tool with Python: -```shell -pip install tpchgen-cli -``` +| Table | Type | Abbr. | Description | Spatial Attributes | Cardinality per SF | +|------------|--------------|-------|---------------------------------------------|----------------------------|--------------------------| +| Trip | Fact Table | `t_` | Individual trip records | pickup & dropoff points | 6M × SF | +| Customer | Dimension | `c_` | Trip customer info | None | 30K × SF | +| Driver | Dimension | `s_` | Trip driver info | None | 500 × SF | +| Vehicle | Dimension | `v_` | Trip vehicle info | None | 100 × SF | +| Zone | Dimension | `z_` | Administrative zones | Polygon | ~236K (fixed) | +| Building | Dimension | `b_` | Building footprints | Polygon | 20K × (1 + log₂(SF)) | -```shell -# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop -tpchgen-cli -s 10 --format=parquet -``` -### Install Using Rust -[Install Rust](https://www.rust-lang.org/tools/install) and this tool: + -```shell -curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -cargo install tpchgen-cli -``` +## Performance + +SpatialBench inherits its speed and efficiency from the tpchgen-rs project, which is one of the fastest open-source data generators available. -```shell -# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop -tpchgen-cli -s 10 --format=parquet +Key performance benefits: +- Zero-copy, streaming architecture: Generates data in constant memory, suitable for very large datasets. +- Multithreaded from the ground up: Leverages all CPU cores for high-throughput generation. +- Arrow-native output: Supports fast serialization to Parquet and other formats without bottlenecks. +- Fast geometry generation: The Spider module generates millions of spatial geometries per second, with deterministic output and affine transforms. + +## How is SpatialBench dbgen built? + +SpatialBench is a Rust-based fork of the tpchgen-rs project. It preserves the original’s high-performance, multi-threaded, streaming architecture, while extending it with a spatial star schema and geometry generation logic. + +You can build the SpatialBench data generator using Cargo: + +```bash +cargo build --release ``` -Or watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) recorded by [@alamb](https://github.com/alamb) -and the companion blog post in the [Datafusion blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/). +Alternatively, install it directly using: -### Examples +```bash +cargo install --path . +``` -```shell +### Notes -# Create a scale factor 10 dataset in the native table format. -tpchgen-cli -s 10 --output-dir sf10 +- The core generator logic lives in the tpchgen crate. +- Geometry-aware logic is in tpchgen-arrow and integrated via Arrow-based schemas. +- The spatial extension modules like the Spider geometry generator reside in [spider.rs](https://github.com/wherobots/sedona-spatialbench/blob/main/tpchgen/src/spider.rs). +- The generator supports output formats like .tbl and Apache Parquet via the Arrow writer. -# Create a scale factor 1 dataset in Parquet format. -tpchgen-cli -s 1 --output-dir sf1-parquet --format=parquet +For contribution or debugging, refer to the [ARCHITECTURE.md](https://github.com/wherobots/sedona-spatialbench/blob/main/ARCHITECTURE.md) guide. -# Create a scale factor 1 (default) partitioned dataset for the region, nation, orders -# and customer tables. -tpchgen-cli --tables region,nation,orders,customer --output-dir sf1-partitioned --parts 10 --part 2 +## Usage -# Create a scale factor 1 partitioned into separate folders. -# -# Each folder will have a single partition of rows, the partition size will depend on the scale -# factor. For tables that have less rows than the minimum partition size like "nation" or "region" -# the generator will produce the same file in each part. -# -# $ md5sum part-*/{nation,region}.tbl -# 2f588e0b7fa72939b498c2abecd9fbbe part-1/nation.tbl -# 2f588e0b7fa72939b498c2abecd9fbbe part-2/nation.tbl -# c235841b00d29ad4f817771fcc851207 part-1/region.tbl -# c235841b00d29ad4f817771fcc851207 part-2/region.tbl -for PART in `seq 1 2`; do - mkdir part-$PART - tpchgen-cli --tables region,nation,orders,customer --output-dir part-$PART --parts 10 --part $PART -done +#### Generate All Tables (Scale Factor 1) + +```bash +tpchgen-cli -s 1 --format=parquet ``` -## Performance +#### Generate Individual Tables -| Scale Factor | `tpchgen-cli` | DuckDB | DuckDB (proprietary) | -| ------------ | ------------- | ---------- | -------------------- | -| 1 | `0:02.24` | `0:12.29` | `0:10.68` | -| 10 | `0:09.97` | `1:46.80` | `1:41.14` | -| 100 | `1:14.22` | `17:48.27` | `16:40.88` | -| 1000 | `10:26.26` | N/A (OOM) | N/A (OOM) | +```bash +tpchgen-cli -s 1 --format=parquet --tables trip,building --output-dir sf1-parquet +``` -- DuckDB (proprietary) is the time required to create TPCH data using the - proprietary DuckDB format -- Creating Scale Factor 1000 data in DuckDB [requires 647 GB of memory], - which is why it is not included in the table above. +#### Partitioned Output Example -[required 647 GB of memory]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +```bash +for PART in $(seq 1 4); do + mkdir part-$PART + tpchgen-cli -s 10 --tables trip,building --output-dir part-$PART --parts 4 --part $PART +done +``` -Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors. +## SedonaBench Spider Data Generator - +SpatialBench includes a synthetic spatial data generator ([spider.rs](https://github.com/wherobots/sedona-spatialbench/blob/main/tpchgen/src/spider.rs)) for creating: +- Points +- Rectangles (boxes) +- Polygons -[`tpchgen-cli`](./tpchgen-cli/README.md) is more than 10x faster than the next -fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily -generates data faster than can be written to SSD. See -[BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more details on performance and -benchmarking. +This generator is inspired by techniques from the paper [SpiderWeb: A Spatial Data Generator on the Web](https://dl.acm.org/doi/10.1145/3397536.3422351) by Katiyar et al., SIGSPATIAL 2020. -## Testing +### Supported Distribution Types -This crate has extensive tests to ensure correctness and produces exactly the -same, byte-for-byte output as the original [`dbgen`] implementation. We compare -the output of this crate with [`dbgen`] as part of every checkin. See -[TESTING.md](TESTING.md) for more details on testing methodology +| Type | Description | +|--------------|---------------------------------------------------------------| +| `UNIFORM` | Uniformly distributed points in `[0,1]²` | +| `NORMAL` | 2D Gaussian distribution with configurable `mu` and `sigma` | +| `DIAGONAL` | Points clustered along a diagonal | +| `BIT` | Points in a grid with `2^digits` resolution | +| `SIERPINSKI` | Fractal pattern using Sierpinski triangle | -## Crates + -- [`tpchgen`](tpchgen): the core data generator logic for TPC-H. It has no - dependencies and is easy to embed in other Rust project. +## Configuring Spider Geometry Generation -- [`tpchgen-arrow`](tpchgen-arrow) generates TPC-H data in [Apache Arrow] - format. It depends on the arrow-rs library +SpatialBench uses a flexible and extensible SpiderConfig struct (defined in Rust) to control how spatial geometries are generated for synthetic datasets. These configurations are defined in code, often using presets in spider_preset.rs. -- [`tpchgen-cli`](tpchgen-cli) is a [`dbgen`] compatible CLI tool that generates - benchmark dataset using multiple processes. +#### SpiderConfig Fields -[Apache Arrow]: https://arrow.apache.org/ -[`dbgen`]: https://github.com/electrum/tpch-dbgen +| Field | Type | Description | +|-------|--------------------|--------------------------------------------------------------------------------| +| `dist_type` | `DistributionType` | Type of distribution to use (Uniform, Normal, Diagonal, Bit, Sierpinski, etc.) | +| `geom_type` | `GeomType` | Geometry to generate: Point, Box, or Polygon | +| `dim` | `i32` | Number of dimensions (usually 2) | +| `seed` | `u32` | Random seed for reproducibility | +| `affine` | `Option<[f64; 6]>` | Optional 2D affine transform (scale, rotate, shift) | +| `width`, `height` | `f64` | For `box` geometries, bounding box size | +| `maxseg` | `i32` | Maximum number of segments for polygon shapes | +| `polysize` | `f64` | Radius or size of the polygon | +| `params` | `DistributionParams` | Additional parameters based on distribution type | -## Contributing +#### Supported DistributionParams Variants -Pull requests are welcome. For major changes, please open an issue first for -discussion. See our [contributors guide](CONTRIBUTING.md) for more details. +| Varient | Field | Description | +|----------------|------------------------|----------------------------------------------------------------------------| +| `None` | `--` | For distributions like Uniform or Sierpinski that don’t require parameters | +| `Normal` | `mu`, `sigma` | Controls center and spread for 2D Gaussian | +| `Diagonal` | `percentage`, `buffer` | Mix of diagonal-aligned points and noisy buffer | +| `Bit` | `probability`, `digits` | Recursive binary split with resolution control | -## Architecture +#### Example: USA Mainland Mapping -Please see [architecture guide](ARCHITECTURE.md) for details on how the code -is structured. +The affine transform maps generated coordinates from the local unit square [0,1]² into real-world extents. For example, the following affine matrix maps coordinates to the continental USA bounding box: -## License +```rust +let affine = Some([ + 58.368269, 0.0, -125.244606, // scale X to ~58°, offset to ~-125° + 0.0, 25.175375, 24.006328 // scale Y to ~25°, offset to ~24° +]); +``` -The project is licensed under the [APACHE 2.0](LICENSE) license. +This maps: +- x = 0 → -125.24, x = 1 → -66.87 +- y = 0 → 24.00, y = 1 → 49.18 -## References -- The TPC-H Specification, see the specification [page](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). -- The Original `dbgen` Implementation you must submit an official request to access the software `dbgen` at their official [website](https://www.tpc.org/tpch/) +## Acknowledgements +- [TPC-H](https://www.tpc.org/tpch/) +- [SpiderWeb: A Spatial Data Generator on the Web](https://dl.acm.org/doi/10.1145/3397536.3422351) +- [tpchgen-rs for inspiration and baseline performance](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/) \ No newline at end of file diff --git a/images/data_model.png b/images/data_model.png new file mode 100644 index 0000000..48d6acc Binary files /dev/null and b/images/data_model.png differ diff --git a/images/spatial_distributions.png b/images/spatial_distributions.png new file mode 100644 index 0000000..f53c8f7 Binary files /dev/null and b/images/spatial_distributions.png differ
