prantogg opened a new pull request, #59:
URL: https://github.com/apache/sedona-spatialbench/pull/59

   This pull request adds support for automatically splitting output files 
based on a maximum file size (in MB) for the spatialbench CLI, and refactors 
partitioning logic to be more flexible and robust. It introduces a new 
`--mb-per-file` option, updates the partitioning and output planning code to 
support this feature, and improves the handling of optional partition 
parameters throughout the codebase.
   
   **New Feature: Output File Size Partitioning**
   - Added a new CLI option `--mb-per-file` to `spatialbench-cli` for 
specifying the maximum output file size in MB. When set, the number of output 
parts is determined automatically, and this option cannot be used with 
`--parts` or `--part`.
   - Updated the partitioning logic in `OutputPlanGenerator` and 
`PartitionStrategy` to calculate the number of parts required to approximate 
the target file size, both for general tables and specifically for the Zone 
table. 
[[1]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7R166-R177)
 
[[2]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7L181-R219)
 
[[3]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R1-R21)
 
[[4]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R31-R48)
   
   **Refactoring and API Improvements**
   - Refactored partitioning and output plan APIs to use `Option<i32>` for 
`parts` and `part` parameters, improving handling of optional values and making 
the interfaces more consistent. 
[[1]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99L9-R11)
 
[[2]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99L19-R22)
 
[[3]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99R31-R58)
 
[[4]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL22-R26)
   - Updated the `ZoneTableStats` and related logic to accept optional `parts`, 
and made `base_stats` public for use in partition calculations. 
[[1]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL22-R26)
 
[[2]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL41-R41)
   
   **Validation and Error Handling**
   - Enhanced validation in `ZoneDfArgs` to ensure that `--mb-per-file` cannot 
be used together with `--parts` or `--part`, and to check for valid part 
numbers when both `parts` and `part` are specified.
   
   **Testing and Documentation**
   - Added unit tests for the new partitioning logic based on maximum file 
size, ensuring correctness for different scenarios.
   - Improved in-code documentation and debug logging for partition 
calculations and output size estimation. 
[[1]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7L181-R219)
 
[[2]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R31-R48)
   
   **Other Improvements**
   - Updated average row size estimates and scaling logic for the Zone table in 
`OutputSize` to improve output size estimation accuracy. 
[[1]](diffhunk://#diff-4edb5edd1b23bd28fce186a3cba8b185d01ed56589ebb07f82ff9f681437b092L254-R254)
 
[[2]](diffhunk://#diff-4edb5edd1b23bd28fce186a3cba8b185d01ed56589ebb07f82ff9f681437b092L266-R268)
   - Propagated the new `mb_per_file` option through all relevant code paths, 
including CLI, output plan generation, and Zone table generation. 
[[1]](diffhunk://#diff-2fac910eb7a4d463642d01eccdee261a5b548817eddd0445fd9a5b13e539d297L347-R357)
 
[[2]](diffhunk://#diff-2fac910eb7a4d463642d01eccdee261a5b548817eddd0445fd9a5b13e539d297R391)
 
[[3]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeR9-R16)
 
[[4]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeL28-R32)
 
[[5]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeL42-R47)
 
[[6]](diffhunk://#diff-6797764d59c28d879d48e95d37c2a26ff52ecd3aed2a4818f6c151e53196a14eR69-R86)
   
   These changes make it easier for users to generate output files that fit 
within a specified size limit, and improve the flexibility and robustness of 
the data generation pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to