[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

Sai Sharath Dandi (Jira) Thu, 21 Dec 2023 12:19:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799560#comment-17799560
 ]


Sai Sharath Dandi edited comment on FLINK-33611 at 12/21/23 8:18 PM:
---------------------------------------------------------------------

[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers in the 
generated code used for each field in the schema for rough estimation. There 
cannot be more than 65536/2 = 32768 fields in the Protobuf schema. Of course, 
the actual number is lower than that because we did not include split method 
names, class names etc.. 


was (Author: JIRAUSER298466):
[~libenchao] , The constant pool size 
[limit|https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4]
 is 65536 entries in java. The constant pool size includes a lot of things but 
if we count only the identifier names and assume there are 2 identifiers used 
for each field in the schema for rough estimation. There cannot be more than 
65536/2 = 32768 fields in the Protobuf schema. Of course, the actual number is 
lower than that because we did not include split method names, class names 
etc.. 

> Support Large Protobuf Schemas
> ------------------------------
>
>                 Key: FLINK-33611
>                 URL: https://issues.apache.org/jira/browse/FLINK-33611
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.18.0
>            Reporter: Sai Sharath Dandi
>            Assignee: Sai Sharath Dandi
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33611) Support Large Protobuf Schemas

Reply via email to