satishkotha edited a comment on pull request #2048: URL: https://github.com/apache/hudi/pull/2048#issuecomment-687391466
@vinothchandar @bvaradar Addressed most of your suggestions. Couple other followup items I need help from you on: 1) You suggested to remove HoodieReplaceStat. I ran into minor implementation issue removing it. Basically, HoodieWriteClient operations return JavaRDD[WriteStatus]. SparkSqlWriter uses these WriteStatus to create metadata (.commit/.replace etc). Each WriteStatus comes with HoodieWriteStat (which is expected to be non-null in many places). This HoodieWriteStat is used for many post commit operations. So if we want to remove HoodieReplaceStat, we can either a) change signature of WriteClient operations to return a new structured object instead of just returning JavaRDD[WriteStatus]. This object would contain List[HoodieFileGroupId] for tracking file groups replaced and JavaRDD[WriteStatus] for newly created file groups. We have to change post commit operations to look at this new object instead of WriteStatus. OR b) Return a WriteStatus for replaced file groups too. WriteClient operations can continue to return JavaRDD[WriteStatus]. Each WriteStatus has HoodieWriteStat which can be a token value (null?) for replaced file groups. Either way, this is somewhat involved change, so would like to get your feedback before starting implementation. What do you guys think? 2) Deleting replaced file groups during archival vs clean. I've this deletion logic implemented in archival per our earlier conversation. But, as I mentioned, this may lead to storage inefficiency. For example, a) clean retain is set to 1 commit. b) archival is set to be done after 24 commits. We keep all the data for replaced files until archival happens. Let me know if you guys have any other comments. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
