[I] [FEATURE] Add S3 write support to cli datagen [sedona-spatialbench]

via GitHub Fri, 02 Jan 2026 23:46:08 -0800


prantogg opened a new issue, #70:
URL: https://github.com/apache/sedona-spatialbench/issues/70


   **Is your feature request related to a problem? Please describe.**
   
   When generating large-scale SpatialBench datasets (e.g., SF1000 or higher), 
there is currently no way to write generated data directly to S3. This creates 
several significant limitations:
   
   - Local Storage bottlenecks: Large-scale datasets can be hundreds of GBs or 
TBs in size, quickly exhausting local disk space. For example, SF1000 Trip 
table alone can exceed 500GB.
   - Workflow inefficiency: The current workflow requires generating data 
locally first, then manually uploading to S3 using separate tools (aws cli, 
rclone, etc.), which is time-consuming and error-prone.
   
   **Describe the solution you'd like**
   
   Add support for S3 URIs in the `--output-dir` parameter, enabling the tool 
to stream generated data directly to S3 without requiring local storage:
   
   ```bash
   # Current workflow:
   spatialbench-cli --scale-factor 1000 --output-dir ./data
   # Then manually: aws s3 cp ./data s3://my-bucket/spatialbench/sf1000 
--recursive
   
   # Proposed workflow:
   spatialbench-cli --scale-factor 1000 --output-dir 
s3://my-bucket/spatialbench/sf1000
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [FEATURE] Add S3 write support to cli datagen [sedona-spatialbench]

Reply via email to