This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/sedona-spatialbench.git

commit f5f2c7c215c029dd5eb05778b9a0bb06b26c9c29
Author: Jia Yu <[email protected]>
AuthorDate: Tue Aug 26 10:29:24 2025 -0700

    Refine the configuration file (#10)
    
    * Rename the file to configuration
    
    * Add detailed explanation to all config items
---
 README.md                         |   2 +-
 SPIDER.md                         | 124 -----------------------------
 spatialbench-cli/CONFIGURATION.md | 163 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 164 insertions(+), 125 deletions(-)

diff --git a/README.md b/README.md
index b07f7ff..1ff86a6 100644
--- a/README.md
+++ b/README.md
@@ -99,7 +99,7 @@ If --config is not provided, SpatialBench checks for 
./spatialbench-config.yml.
 
 For reference, see the provided 
[spatialbench-config.yml](spatialbench-config.yml).
 
-See [SPIDER.md](SPIDER.md) for more details about spatial data generation and 
the full YAML schema and examples.
+See [CONFIGURATION.md](./spatialbench-cli/CONFIGURATION.md) for more details 
about spatial data generation and the full YAML schema and examples.
 
 ## Acknowledgements
 - [TPC-H](https://www.tpc.org/tpch/)
diff --git a/SPIDER.md b/SPIDER.md
deleted file mode 100644
index 8930bdd..0000000
--- a/SPIDER.md
+++ /dev/null
@@ -1,124 +0,0 @@
-# SpatialBench Spider Data Generator
-
-Spider module is SpatialBench’s built-in spatial geometry generator.
-It creates Points, Boxes, and Polygons using deterministic random 
distributions.
-
-Spider is designed for benchmark reproducibility:
-- Generates millions of geometries per second.
-- Uses seeds for deterministic output.
-- Supports affine transforms to map the unit square [0,1]² into real-world 
coordinates.
-
-Reference: [SpiderWeb: A Spatial Data Generator on the 
Web](https://dl.acm.org/doi/10.1145/3397536.3422351) by Katiyar et al., 
SIGSPATIAL 2020.
-
-## Supported Distribution Types
-
-| Type         | Description                                                   
|
-|--------------|---------------------------------------------------------------|
-| `UNIFORM`    | Uniformly distributed points in `[0,1]²`                      
|
-| `NORMAL`     | 2D Gaussian distribution with configurable `mu` and `sigma`   
|
-| `DIAGONAL`   | Points clustered along a diagonal                             
|
-| `BIT`        | Points in a grid with `2^digits` resolution                   
|
-| `SIERPINSKI` | Fractal pattern using Sierpinski triangle                     
|
-
-![image.png](images/spatial_distributions.png)
-
-## Using Spider in the CLI
-
-```bash
-spatialbench-cli -s 1 --tables trip,building --config spatialbench-config.yaml
-```
-
-If --config is omitted, SpatialBench will try a local default and then fall 
back to built-ins (see [Configuration Resolution & 
Logging](#configuration-resolution--logging)).
-
-## Expected Config File Structure
-
-At the top level, the YAML may define:
-
-```yaml
-trip:      # (optional) Config for Trip pickup points
-building:  # (optional) Config for Building polygons
-```
-
-Each entry must conform to the SpiderConfig schema:
-
-```yaml
-<name>:
-  dist_type: <string>        # uniform | normal | diagonal | bit | sierpinski 
| parcel
-  geom_type: <string>        # point | box | polygon
-  dim: <int>                 # usually 2
-  seed: <int>                # random seed for reproducibility
-  affine: [f64; 6]           # optional affine transform
-  width: <float>             # used if geom_type = box
-  height: <float>            # used if geom_type = box
-  maxseg: <int>              # polygon max segments
-  polysize: <float>          # polygon size or radius
-  params:                    # distribution-specific parameters
-    type: <string>           # one of: none, normal, diagonal, bit, parcel
-    ...                      # fields depend on type (see table below)
-```
-
-## Supported Distribution Parameters
-
-| Variant    | Field                  | Description                            
                                    |
-|------------|------------------------|----------------------------------------------------------------------------|
-| `None`     | `--`                   | For distributions like Uniform or 
Sierpinski that don’t require parameters |
-| `Normal`   | `mu`, `sigma`          | Controls center and spread for 2D 
Gaussian                                 |
-| `Diagonal` | `percentage`, `buffer` | Mix of diagonal-aligned points and 
noisy buffer                            |
-| `Bit`      | `probability`, `digits` | Recursive binary split with 
resolution control                             |
-
-## Default Configs
-
-The repository includes a ready-to-use default file:
-[`spatialbench-config.yml`](/spatialbench-config.yml).
-
-These defaults are automatically used if no `--config` is passed and the file 
exists in the current working directory.
-
-## Configuration Resolution & Logging
-
-When SpatialBench starts, it resolves configuration in this order:
-
-1. Explicit config: If --config <path> is provided, that file is used.
-2. Local default: If no flag is provided, SpatialBench looks for 
./spatialbench-config.yml in the current directory.
-3. Built-ins: If neither is found, it uses compiled defaults from 
spider_defaults.rs.
-
-## Affine Transform
-
-The affine transform maps coordinates from the unit square [0,1]² into 
real-world ranges.
-It is expressed as an array of 6 numbers:
-
-```
-[a, b, c, d, e, f]
-```
-
-Applied as:
-
-```
-X = a*x + b*y + c
-Y = d*x + e*y + f
-```
-
-- a, e → scale factors in X and Y.
-- b, d → shear/skew (usually 0 for simple scaling).
-- c, f → translation offsets.
-
-#### How to fill it
-
-1. Decide the bounding box of your target region:
-   - Example (continental USA): [-125.24, 24.00, -66.87, 49.18] → west, south, 
east, north.
-2. Compute scale and offset:
-   - scale_x = (east - west)
-   - scale_y = (north - south)
-   - offset_x = west
-   - offset_y = south
-3. Plug into [a, b, c, d, e, f] with no skew:
-   - [scale_x, 0.0, offset_x, 0.0, scale_y, offset_y]
-
-#### Example: Mapping [0,1]² to Continental USA
-
-```yaml
-affine: [58.368269, 0.0, -125.244606, 0.0, 25.175375, 24.006328]
-```
-
-Which means:
-- x=0 → -125.24, x=1 → -66.87
-- y=0 → 24.00, y=1 → 49.18
\ No newline at end of file
diff --git a/spatialbench-cli/CONFIGURATION.md 
b/spatialbench-cli/CONFIGURATION.md
new file mode 100644
index 0000000..cc48203
--- /dev/null
+++ b/spatialbench-cli/CONFIGURATION.md
@@ -0,0 +1,163 @@
+# SpatialBench Configuration
+
+SpatialBench configuration allows you to customize spatial data generation for 
your benchmark workloads. The spatial geometry generation is powered by the 
Spider module, which creates Points, Boxes, and Polygons using deterministic 
random distributions.
+
+Spider is designed for benchmark reproducibility:
+- Generates millions of geometries per second.
+- Uses seeds for deterministic output.
+- Supports affine transforms to map the unit square [0,1]² into real-world 
coordinates.
+
+Reference: [SpiderWeb: A Spatial Data Generator on the 
Web](https://dl.acm.org/doi/10.1145/3397536.3422351) by Katiyar et al., 
SIGSPATIAL 2020.
+
+## Supported Distribution Types
+
+| Type         | Description                                                   
| Implementation Details |
+|--------------|---------------------------------------------------------------|------------------------|
+| `UNIFORM`    | Uniformly distributed points in `[0,1]²`                      
| Uses `rand_unit()` to generate independent random X and Y coordinates. Each 
coordinate is a uniform random value between 0.0 and 1.0. |
+| `NORMAL`     | 2D Gaussian distribution with configurable `mu` and `sigma`   
| Uses Box-Muller transform to generate normal distributions. Both X and Y 
coordinates use the same `mu` and `sigma` parameters. Values are clamped to 
[0,1]². |
+| `DIAGONAL`   | Points clustered along a diagonal                             
| With probability `percentage`, generates points exactly on y=x diagonal. 
Otherwise, generates points with normal noise around the diagonal using 
`buffer` as standard deviation. |
+| `BIT`        | Points in a grid with `2^digits` resolution                   
| Uses recursive binary subdivision. Each bit position has `probability` chance 
of being set. Creates a deterministic grid pattern with resolution 2^digits × 
2^digits. |
+| `SIERPINSKI` | Fractal pattern using Sierpinski triangle                     
| Uses chaos game algorithm with 10 iterations. Randomly moves toward one of 
three triangle vertices (0,0), (1,0), or (0.5,√3/2). Creates fractal-like 
clustering patterns. |
+
+![image.png](../images/spatial_distributions.png)
+
+## Geometry Types
+
+| Type      | Description | Implementation Details |
+|-----------|-------------|------------------------|
+| `point`   | Single coordinate point | Direct output of generated coordinates 
after affine transform. |
+| `box`     | Rectangular polygon | Creates a rectangle centered on generated 
coordinates. Width and height are randomized between 0 and the configured 
`width`/`height` values. |
+| `polygon` | Regular polygon | Creates a polygon with 3 to `maxseg` sides, 
centered on generated coordinates with radius `polysize`. Number of sides is 
randomized. |
+
+## Using Configuration in the CLI
+
+```bash
+spatialbench-cli -s 1 --tables trip,building --config spatialbench-config.yaml
+```
+
+If --config is omitted, SpatialBench will try a local default and then fall 
back to built-ins (see [Configuration Resolution & 
Logging](#configuration-resolution--logging)).
+
+## Expected Config File Structure
+
+At the top level, the YAML may define:
+
+```yaml
+trip:      # (optional) Config for Trip pickup points
+building:  # (optional) Config for Building polygons
+```
+
+Each entry must conform to the configuration schema:
+
+```yaml
+<name>:
+  dist_type: <string>        # Distribution algorithm: uniform | normal | 
diagonal | bit | sierpinski
+  geom_type: <string>        # Geometry type: point | box | polygon
+  dim: <int>                 # Dimensions (always 2 for 2D spatial data)
+  seed: <int>                # Random seed for deterministic generation
+  affine: [f64; 6]           # Optional coordinate transformation matrix
+  width: <float>             # Box width (used only when geom_type = box)
+  height: <float>            # Box height (used only when geom_type = box)
+  maxseg: <int>              # Maximum polygon segments (used only when 
geom_type = polygon)
+  polysize: <float>          # Polygon radius/size (used only when geom_type = 
polygon)
+  params:                    # Distribution-specific parameters
+    type: <string>           # Parameter type: none | normal | diagonal | bit
+    ...                      # Additional fields depend on type (see table 
below)
+```
+
+### Configuration Field Details
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `dist_type` | string | Yes | **Distribution Algorithm**: Controls how 
coordinates are generated in the unit square [0,1]² before applying affine 
transforms. |
+| `geom_type` | string | Yes | **Geometry Type**: Determines the final spatial 
geometry output format. |
+| `dim` | int | Yes | **Dimensions**: Always 2 for 2D spatial data. Controls 
the dimensionality of generated coordinates. |
+| `seed` | int | Yes | **Random Seed**: Ensures reproducible generation. Each 
record uses a deterministic hash of this seed combined with the record index. |
+| `affine` | [f64; 6] | No | **Coordinate Transform**: Maps unit square [0,1]² 
to real-world coordinates. Applied after coordinate generation. |
+| `width` | float | Yes | **Box Width**: Maximum width of generated boxes (in 
unit square coordinates). Actual width is randomized between 0 and this value. |
+| `height` | float | Yes | **Box Height**: Maximum height of generated boxes 
(in unit square coordinates). Actual height is randomized between 0 and this 
value. |
+| `maxseg` | int | Yes | **Max Polygon Segments**: Maximum number of sides for 
generated polygons. Minimum is 3, actual count is randomized between 3 and this 
value. |
+| `polysize` | float | Yes | **Polygon Size**: Radius of generated polygons 
from their center point (in unit square coordinates). |
+| `params` | object | Yes | **Distribution Parameters**: Specific parameters 
for the chosen distribution type. |
+
+## Supported Distribution Parameters
+
+| Variant    | Field                  | Type | Description |
+|------------|------------------------|------|-------------|
+| `None`     | `--`                   | `--` | **No Parameters**: Used for 
Uniform and Sierpinski distributions that don't require additional 
configuration. |
+| `Normal`   | `mu`                   | float | **Mean**: Center point of the 
2D Gaussian distribution in [0,1]². Applied to both X and Y coordinates. |
+|            | `sigma`                | float | **Standard Deviation**: Spread 
of the 2D Gaussian distribution. Controls how clustered the points are around 
the mean. |
+| `Diagonal` | `percentage`           | float | **Diagonal Percentage**: 
Fraction of points (0.0-1.0) that lie exactly on the diagonal line y=x. |
+|            | `buffer`               | float | **Buffer Width**: Standard 
deviation for the normal distribution used to generate noise around the 
diagonal for non-diagonal points. |
+| `Bit`      | `probability`          | float | **Bit Probability**: 
Probability (0.0-1.0) of setting each bit in the recursive binary subdivision. 
Controls the density of the grid pattern. |
+|            | `digits`               | int | **Bit Digits**: Number of bits 
used in the recursive subdivision. Creates a 2^digits × 2^digits grid 
resolution. Higher values create finer grids. |
+
+## Default Configs
+
+The repository includes a ready-to-use default file:
+[`spatialbench-config.yml`](../spatialbench-config.yml).
+
+These defaults are automatically used if no `--config` is passed and the file 
exists in the current working directory.
+
+## Deterministic Generation
+
+SpatialBench ensures reproducible generation through deterministic seeding:
+
+- **Global Seed**: The `seed` field in configuration provides the base for all 
random generation
+- **Record-Specific Seeds**: Each record uses a deterministic hash combining 
the global seed with the record index
+- **Hash Algorithm**: Uses a SplitMix64-like algorithm for fast, high-quality 
deterministic hashing
+- **Reproducibility**: Same seed + same record index always produces identical 
output
+
+This allows for:
+- Exact reproduction of datasets across different runs
+- Parallel generation of different parts that combine correctly
+- Consistent benchmark results for performance testing
+
+## Configuration Resolution & Logging
+
+When SpatialBench starts, it resolves configuration in this order:
+
+1. Explicit config: If --config <path> is provided, that file is used.
+2. Local default: If no flag is provided, SpatialBench looks for 
./spatialbench-config.yml in the current directory.
+3. Built-ins: If neither is found, it uses compiled defaults from the built-in 
configuration.
+
+## Affine Transform
+
+The affine transform maps coordinates from the unit square [0,1]² into 
real-world ranges.
+It is expressed as an array of 6 numbers:
+
+```
+[a, b, c, d, e, f]
+```
+
+Applied as:
+
+```
+X = a*x + b*y + c
+Y = d*x + e*y + f
+```
+
+- a, e → scale factors in X and Y.
+- b, d → shear/skew (usually 0 for simple scaling).
+- c, f → translation offsets.
+
+#### How to fill it
+
+1. Decide the bounding box of your target region:
+   - Example (continental USA): [-125.24, 24.00, -66.87, 49.18] → west, south, 
east, north.
+2. Compute scale and offset:
+   - scale_x = (east - west)
+   - scale_y = (north - south)
+   - offset_x = west
+   - offset_y = south
+3. Plug into [a, b, c, d, e, f] with no skew:
+   - [scale_x, 0.0, offset_x, 0.0, scale_y, offset_y]
+
+#### Example: Mapping [0,1]² to Continental USA
+
+```yaml
+affine: [58.368269, 0.0, -125.244606, 0.0, 25.175375, 24.006328]
+```
+
+Which means:
+- x=0 → -125.24, x=1 → -66.87
+- y=0 → 24.00, y=1 → 49.18
\ No newline at end of file

Reply via email to