satishkotha commented on a change in pull request #1355: [HUDI-633] limit 
archive file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384241881
 
 

 ##########
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##########
 @@ -245,11 +249,23 @@ public void archive(List<HoodieInstant> instants) throws 
HoodieCommitException {
       Schema wrapperSchema = HoodieArchivedMetaEntry.getClassSchema();
       LOG.info("Wrapper schema " + wrapperSchema.toString());
       List<IndexedRecord> records = new ArrayList<>();
+      long totalInMemSize = 0;
       for (HoodieInstant hoodieInstant : instants) {
         try {
-          records.add(convertToAvroRecord(commitTimeline, hoodieInstant));
-          if (records.size() >= this.config.getCommitArchivalBatchSize()) {
+          IndexedRecord record = convertToAvroRecord(commitTimeline, 
hoodieInstant);
+          totalInMemSize += this.sizeEstimator.sizeEstimate(record);
 
 Review comment:
   size of clean files is very uneven. If we set 500MB as size limit, sample 
every 4 instants. There is a posibility that there are 3 consecutive clean 
files with 200MB each and break the memory limit set.
   
   I did some analysis and time taken in worst case is 680ms for really large 
record. For small records, its almost negligible. (Note max is taken on 
multiple runs. this is based on combination of clean and commit files in a 
production dataset)
   
   Time-taken-for-size-estimate: 1 ms. size: 584
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 681 ms. size: 144275608
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 159 ms. size: 50727808
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 129 ms. size: 38859144
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 127 ms. size: 37873704
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 129 ms. size: 34290648
   Time-taken-for-size-estimate: 0 ms. size: 584
   Time-taken-for-size-estimate: 3 ms. size: 921256
   
   If you think this is a concern, we can also consider using file size as 
estimate. (Disadvantage is we need to make api call to hdfs to get file size)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to