satishkotha commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-687391466


   @vinothchandar @bvaradar  Addressed most of your suggestions. Couple other 
followup items I need help from you on:
   
   1) You suggested to remove HoodieReplaceStat.  I ran into minor 
implementation issue removing it. Basically, HoodieWriteClient operations 
return JavaRDD<WriteStatus>.  SparkSqlWriter uses these WriteStatus to create 
metadata (.commit/.replace etc).  Each WriteStatus comes with HoodieWriteStat 
(which is expected to be non-null in many places). This HoodieWriteStat is used 
for many post commit operations. So if we want to remove HoodieReplaceStat,  we 
can either
   a) change signature of WriteClient operations to return a new structured 
object instead of just returning JavaRDD<WriteStatus>. This object would 
contain  JavaRDD<WriteStatus> for newly created files and 
List<HoodieFileGroupId> for tracking file groups replaced. We have to change 
post commit operations to look at this new object instead of WriteStatus.
   OR
   b) Return a WriteStatus for replaced file groups too. WriteClient operations 
can continue to return JavaRDD<WriteStatus>. Each WriteStatus has 
HoodieWriteStat which can be a token value (null?) for replaced file groups. 
   
   Either way, this is somewhat involved change, so would like to get your 
feedback before starting implementation. What do you guys think?
   
   2) Deleting replaced file groups during archival vs clean. I've this 
deletion logic implemented in archival per our earlier conversation. But, as I 
mentioned, this may lead to storage inefficiency. For example, a) clean retain 
is set to 1 commit.  b) archival is done after 24 commits. We keep all the data 
for replaced files until archival happens. 
   
   Let me know if you guys have any other comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to