compaction performance by reusing avro schema [hudi]

via GitHub Wed, 09 Apr 2025 05:54:02 -0700


TheR1sing3un commented on code in PR #12949:
URL: https://github.com/apache/hudi/pull/12949#discussion_r2035286585



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -95,8 +96,8 @@ protected HoodieWriteHandle(HoodieWriteConfig config, String 
instantTime, String
     super(config, Option.of(instantTime), hoodieTable);
     this.partitionPath = partitionPath;
     this.fileId = fileId;
-    this.writeSchema = overriddenSchema.orElseGet(() -> 
getWriteSchema(config));
-    this.writeSchemaWithMetaFields = 
HoodieAvroUtils.addMetadataFields(writeSchema, 
config.allowOperationMetadataField());
+    this.writeSchema = AvroSchemaCache.intern(overriddenSchema.orElseGet(() -> 
getWriteSchema(config)));
+    this.writeSchemaWithMetaFields = 
AvroSchemaCache.intern(HoodieAvroUtils.addMetadataFields(writeSchema, 
config.allowOperationMetadataField()));

Review Comment:
   > When introducing the necessity of PR, it was mentioned that a lot of time 
was spent on unnecessary avro schema comparisons, namely the 
`HoodieInternalRowUtils#getCachedSchema `section. However, I noticed that there 
was no modification made to the `getCachedSchema` method in this PR. So, I 
would like to ask how this PR achieves the effect of speeding up, or how the 
cache value is reflected. Thank you.
   
   You can learn more about the background and solution of this optimization 
through the introduction of my pr. In fact, the essence is `getCachedSchema` 
will take a `StructType` from a map with `Schema` as the key and `StructType` 
as the value. The bottleneck of this method, as you can see from the flame 
diagram, is actually doing the lookup of the key in the map according to the 
incoming schema. You should know that the lookup process of the key in the map 
will call the `Schema::equals` method to determine whether there are existing 
key values. So this `equals` method is our performance bottleneck, because when 
there are many columns in schema, `equals` will compare whether each field is 
equal, which is very expensive. And this method is called once per record, so 
the overall overhead is very high. My solution is to avoid calling `equals` 
when looking up keys. If the reference to the schema passed in is equal to the 
reference to an existing key, we simply assume that `equals` is tr
 ue. We don't need to compare all fields. Based on the above analysis, what I 
need to do now is to try to create only one jvm object within a entire jvm 
lifecycle with the same schema, so the jvm object reference with the same 
schema will always be the same. This is why this performance issue can be 
resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9152] Improve read/write/compaction performance by reusing avro schema [hudi]

Reply via email to