prantogg opened a new pull request, #59: URL: https://github.com/apache/sedona-spatialbench/pull/59
This pull request adds support for automatically splitting output files based on a maximum file size (in MB) for the spatialbench CLI, and refactors partitioning logic to be more flexible and robust. It introduces a new `--mb-per-file` option, updates the partitioning and output planning code to support this feature, and improves the handling of optional partition parameters throughout the codebase. **New Feature: Output File Size Partitioning** - Added a new CLI option `--mb-per-file` to `spatialbench-cli` for specifying the maximum output file size in MB. When set, the number of output parts is determined automatically, and this option cannot be used with `--parts` or `--part`. - Updated the partitioning logic in `OutputPlanGenerator` and `PartitionStrategy` to calculate the number of parts required to approximate the target file size, both for general tables and specifically for the Zone table. [[1]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7R166-R177) [[2]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7L181-R219) [[3]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R1-R21) [[4]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R31-R48) **Refactoring and API Improvements** - Refactored partitioning and output plan APIs to use `Option<i32>` for `parts` and `part` parameters, improving handling of optional values and making the interfaces more consistent. [[1]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99L9-R11) [[2]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99L19-R22) [[3]](diffhunk://#diff-5f6ea8facea654329a64beaae1f19b8f78acedde5dbfdc9c8c39606ccaacdd99R31-R58) [[4]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL22-R26) - Updated the `ZoneTableStats` and related logic to accept optional `parts`, and made `base_stats` public for use in partition calculations. [[1]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL22-R26) [[2]](diffhunk://#diff-7583eb74973e2b874b179e6ac985e678284c6f155f83ab8d8af0bcb97d26524cL41-R41) **Validation and Error Handling** - Enhanced validation in `ZoneDfArgs` to ensure that `--mb-per-file` cannot be used together with `--parts` or `--part`, and to check for valid part numbers when both `parts` and `part` are specified. **Testing and Documentation** - Added unit tests for the new partitioning logic based on maximum file size, ensuring correctness for different scenarios. - Improved in-code documentation and debug logging for partition calculations and output size estimation. [[1]](diffhunk://#diff-34bb0d61fb79250978f76d168c1e15e9965b5e62114bd7ebcb69155bf6b187f7L181-R219) [[2]](diffhunk://#diff-3c05b0e32452214c0ab0d5b1d41b3e1b88f54a0081ab93908c6100772764b080R31-R48) **Other Improvements** - Updated average row size estimates and scaling logic for the Zone table in `OutputSize` to improve output size estimation accuracy. [[1]](diffhunk://#diff-4edb5edd1b23bd28fce186a3cba8b185d01ed56589ebb07f82ff9f681437b092L254-R254) [[2]](diffhunk://#diff-4edb5edd1b23bd28fce186a3cba8b185d01ed56589ebb07f82ff9f681437b092L266-R268) - Propagated the new `mb_per_file` option through all relevant code paths, including CLI, output plan generation, and Zone table generation. [[1]](diffhunk://#diff-2fac910eb7a4d463642d01eccdee261a5b548817eddd0445fd9a5b13e539d297L347-R357) [[2]](diffhunk://#diff-2fac910eb7a4d463642d01eccdee261a5b548817eddd0445fd9a5b13e539d297R391) [[3]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeR9-R16) [[4]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeL28-R32) [[5]](diffhunk://#diff-c3e9936d1af70316c9d6f46791981c2e89b731b2af40645b608f24d13afe6eaeL42-R47) [[6]](diffhunk://#diff-6797764d59c28d879d48e95d37c2a26ff52ecd3aed2a4818f6c151e53196a14eR69-R86) These changes make it easier for users to generate output files that fit within a specified size limit, and improve the flexibility and robustness of the data generation pipeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
