[ https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799560#comment-17799560 ]
Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 8:18 PM: --------------------------------------------------------------------- [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers in the generated code used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. was (Author: JIRAUSER298466): [~libenchao] , The constant pool size [limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4] is 65536 entries in java. The constant pool size includes a lot of things but if we count only the identifier names and assume there are 2 identifiers used for each field in the schema for rough estimation. There cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is lower than that because we did not include split method names, class names etc.. > Support Large Protobuf Schemas > ------------------------------ > > Key: FLINK-33611 > URL: https://issues.apache.org/jira/browse/FLINK-33611 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) > Affects Versions: 1.18.0 > Reporter: Sai Sharath Dandi > Assignee: Sai Sharath Dandi > Priority: Major > Labels: pull-request-available > > h3. Background > Flink serializes and deserializes protobuf format data by calling the decode > or encode method in GeneratedProtoToRow_XXX.java generated by codegen to > parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the > ability to split the generated code to improve the performance for large > Protobuf schemas. However, this is still not sufficient to support some > larger protobuf schemas as the generated code exceeds the java constant pool > size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] > and we can see errors like "Too many constants" when trying to compile the > generated code. > *Solution* > Since we already have the split code functionality already introduced, the > main proposal here is to now reuse the variable names across different split > method scopes. This will greatly reduce the constant pool size. One more > optimization is to only split the last code segment also only when the size > exceeds split threshold limit. Currently, the last segment of the generated > code is always being split which can lead to too many split methods and thus > exceed the constant pool size limit -- This message was sent by Atlassian Jira (v8.20.10#820010)