jesspav opened a new issue, #247:
URL: https://github.com/apache/sedona-db/issues/247
Hi folks,
Pulled together requirements and a design for a memory model for an
efficient, flexible in-memory model for raster data that aligns with SedonaDB's
architecture. Also built a quick prototype with the designed schema, as well
as some accessors and some functions that use them.
Looking forward to your feedback.
## Data to Store
### Raster Metadata
Standard raster metadata fields in Sedona, Havasu, and similar systems:
- **Width**: Pixel count along X axis
- **Height**: Pixel count along Y axis
- **UpperleftX** / **UpperleftY**: Upper-left corner coordinates (CRS units)
- **ScaleX** / **ScaleY**: Cell scaling factors
- **SkewX** / **SkewY**: Cell skew parameters
### Bounding Box
- Optional WGS84 bounding box essential for speeding up spatial queries
### CRS
- Full CRS info (including SRID)
### Raster Bands
Bands store metadata and, for in-db, the data itself. We may want to expand
this to include statistical data about the bands as well.
- **NoDataValue**: The no data value
- **Storage Types**:
- OutDB Ref: External reference
- InDB: In-memory values on the band
- InDb Reference: Memory pool storage (outside Arrow array) - not in the
initial release
- **Data Types**: Standard data types including UInt8, Int32, Float32, etc.
Initial version will not include complex types, but we would like to include
these in the future
- **Out Db Metadata:** URL + band id
- **Compression Type:** The compression type of the band
- **Data**
**Structure:**
```
Raster
├─ Metadata
├─ BBox
├─ CRS
└─ Bands
└─ Band
├─ Metadata
├─ Statistics (optional)
└─ Data
```
## Apache Arrow Arrays
SedonaDB leverages Apache Arrow for speed and efficiency:
- Immutable: Metadata updates require array copies
- Columnar: Fast metadata queries
- Typed: Runtime validation
- Zero-Copy: Language interoperability
- Null Handling: Efficient bitmaps
- Batch/Vectorized: SIMD-ready operations
**StructArrays:**
- Struct arrays separate fields as independent child arrays, enabling
flexible queries and future-proofing.
See the schema prototype using StructArrays:
[sedona-schema/src/datatypes.rs#L368-L481](https://github.com/jesspav/sedona-db/blob/prototype_raster/rust/sedona-schema/src/datatypes.rs#L368-L481)
## Design Considerations
### Access Patterns
- **Loading/Writing:** `RS_FromGeoTiff`, `RS_AsGeoTiff` operate on full
raster objects
- **Aggregators:** Functions like `RS_Union` merge ArrowArrays, creating new
ones
- **Predicates:** Metadata (esp. bounding box) can be stored in dedicated
columns for fast queries
- **Array-Based Operators:** Optional per-band statistics (min, max, mean)
enable shortcut computations
### Tiling
- Large rasters are split into smaller tiles for performance and scalability
- New tiles have the smaller rasters have the tile width/height and upper
left corner adjusted to the appropriate point and the smaller subset of the
data on the band
### GDAL Integration
- Arrow buffers can be mapped directly to slices for GDAL if types match and
no nulls
### Vectorized Processing
- SIMD ops on band data; metadata in columnar layout enables rapid predicate
evaluation
- RecordBatch sizing controls row/column orientation
### Compression
- Expect to expand the design later for per-band compression; since columnar
data compresses well
## Prototype
https://github.com/jesspav/sedona-db/pull/2/files
## References
- [SedonaType enum and Arrow
integration](https://github.com/apache/sedona-db/blob/main/rust/sedona-schema/src/datatypes.rs)
- [GeoArrow C
integration](https://github.com/apache/sedona-db/blob/main/c/sedona-geoarrow-c/src/geoarrow_c.rs)
- [Arrow array schema
handling](https://github.com/apache/sedona-db/blob/main/python/sedonadb/src/udf.rs)
- [Sedona Raster
Functions](https://sedona.apache.org/1.6.1/api/sql/Raster-operators/)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]