TheR1sing3un commented on code in PR #12949:
URL: https://github.com/apache/hudi/pull/12949#discussion_r2035286585
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -95,8 +96,8 @@ protected HoodieWriteHandle(HoodieWriteConfig config, String
instantTime, String
super(config, Option.of(instantTime), hoodieTable);
this.partitionPath = partitionPath;
this.fileId = fileId;
- this.writeSchema = overriddenSchema.orElseGet(() ->
getWriteSchema(config));
- this.writeSchemaWithMetaFields =
HoodieAvroUtils.addMetadataFields(writeSchema,
config.allowOperationMetadataField());
+ this.writeSchema = AvroSchemaCache.intern(overriddenSchema.orElseGet(() ->
getWriteSchema(config)));
+ this.writeSchemaWithMetaFields =
AvroSchemaCache.intern(HoodieAvroUtils.addMetadataFields(writeSchema,
config.allowOperationMetadataField()));
Review Comment:
> When introducing the necessity of PR, it was mentioned that a lot of time
was spent on unnecessary avro schema comparisons, namely the
`HoodieInternalRowUtils#getCachedSchema `section. However, I noticed that there
was no modification made to the `getCachedSchema` method in this PR. So, I
would like to ask how this PR achieves the effect of speeding up, or how the
cache value is reflected. Thank you.
You can learn more about the background and solution of this optimization
through the introduction of my pr. In fact, the essence is `getCachedSchema`
will take a `StructType` from a map with `Schema` as the key and `StructType`
as the value. The bottleneck of this method, as you can see from the flame
diagram, is actually doing the lookup of the key in the map according to the
incoming schema. You should know that the lookup process of the key in the map
will call the `Schema::equals` method to determine whether there are existing
key values. So this `equals` method is our performance bottleneck, because when
there are many columns in schema, `equals` will compare whether each field is
equal, which is very expensive. And this method is called once per record, so
the overall overhead is very high. My solution is to avoid calling `equals`
when looking up keys. If the reference to the schema passed in is equal to the
reference to an existing key, we simply assume that `equals` is tr
ue. We don't need to compare all fields. Based on the above analysis, what I
need to do now is to try to create only one jvm object within a entire jvm
lifecycle with the same schema, so the jvm object reference with the same
schema will always be the same. This is why this performance issue can be
resolved.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]