satishkotha edited a comment on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-687391466


   @vinothchandar @bvaradar  Addressed most of your suggestions. Couple other 
followup items I need help from you on:
   
   1) You suggested to remove HoodieReplaceStat.  I ran into minor 
implementation issue removing it. Basically, HoodieWriteClient operations 
return JavaRDD[WriteStatus].  SparkSqlWriter uses these WriteStatus to create 
metadata (.commit/.replace etc).  Each WriteStatus comes with HoodieWriteStat 
(which is expected to be non-null in many places). This HoodieWriteStat is used 
for many post commit operations. So if we want to remove HoodieReplaceStat,  we 
can either
   a) change signature of WriteClient operations to return a new structured 
object instead of just returning JavaRDD[WriteStatus]. This object would 
contain List[HoodieFileGroupId] for tracking file groups replaced and 
JavaRDD[WriteStatus] for newly created file groups. We have to change post 
commit operations to look at this new object instead of WriteStatus.
   OR
   b) Return a WriteStatus for replaced file groups too. WriteClient operations 
can continue to return JavaRDD[WriteStatus]. Each WriteStatus has 
HoodieWriteStat which can be a token value (null?) for replaced file groups. 
   
   Either way, this is somewhat involved change, so would like to get your 
feedback before starting implementation. What do you guys think?
   
   2) Deleting replaced file groups during archival vs clean. I've this 
deletion logic implemented in archival per our earlier conversation. But, as I 
mentioned, this may lead to storage inefficiency. For example, a) clean retain 
is set to 1 commit.  b) archival is set to be done after 24 commits. We keep 
all the data for replaced files until archival happens. 
   
   Let me know if you guys have any other comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to