satishkotha commented on a change in pull request #1355: [HUDI-633] limit
archive file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384241881
##########
File path:
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
##########
@@ -245,11 +249,23 @@ public void archive(List<HoodieInstant> instants) throws
HoodieCommitException {
Schema wrapperSchema = HoodieArchivedMetaEntry.getClassSchema();
LOG.info("Wrapper schema " + wrapperSchema.toString());
List<IndexedRecord> records = new ArrayList<>();
+ long totalInMemSize = 0;
for (HoodieInstant hoodieInstant : instants) {
try {
- records.add(convertToAvroRecord(commitTimeline, hoodieInstant));
- if (records.size() >= this.config.getCommitArchivalBatchSize()) {
+ IndexedRecord record = convertToAvroRecord(commitTimeline,
hoodieInstant);
+ totalInMemSize += this.sizeEstimator.sizeEstimate(record);
Review comment:
size of clean files is very uneven. If we set 500MB as size limit, sample
every 4 instants. There is a posibility that there are 3 consecutive clean
files with 200MB each and break the memory limit set.
I did some analysis and time taken in worst case is 680ms for really large
record. For small records, its almost negligible. (Note max is taken on
multiple runs. this is based on combination of clean and commit files in a
production dataset)
Time-taken-for-size-estimate: 1 ms. size: 584
Time-taken-for-size-estimate: 0 ms. size: 408
Time-taken-for-size-estimate: 0 ms. size: 400
Time-taken-for-size-estimate: 681 ms. size: 144275608
Time-taken-for-size-estimate: 0 ms. size: 408
Time-taken-for-size-estimate: 0 ms. size: 400
Time-taken-for-size-estimate: 159 ms. size: 50727808
Time-taken-for-size-estimate: 0 ms. size: 408
Time-taken-for-size-estimate: 0 ms. size: 400
Time-taken-for-size-estimate: 129 ms. size: 38859144
Time-taken-for-size-estimate: 0 ms. size: 408
Time-taken-for-size-estimate: 0 ms. size: 400
Time-taken-for-size-estimate: 127 ms. size: 37873704
Time-taken-for-size-estimate: 0 ms. size: 408
Time-taken-for-size-estimate: 0 ms. size: 400
Time-taken-for-size-estimate: 129 ms. size: 34290648
Time-taken-for-size-estimate: 0 ms. size: 584
Time-taken-for-size-estimate: 3 ms. size: 921256
If you think this is a concern, we can also consider using file size as
estimate. (Disadvantage is we need to make api call to hdfs to get file size)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services