[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800303#comment-17800303 ]
Benchao Li commented on FLINK-33611: ------------------------------------ [~dsaisharath] Sure, I'll review the PR. I may respond slow since I'm participating in the community in my spare time. > Support Large Protobuf Schemas > ------------------------------ > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) > Affects Versions: 1.18.0 > Reporter: Sai Sharath Dandi > Assignee: Sai Sharath Dandi > Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)