Tiansu Yu created FLINK-29579:
---------------------------------

             Summary: Flink parquet reader cannot read fully optional elements 
in a repeated list
                 Key: FLINK-29579
                 URL: https://issues.apache.org/jira/browse/FLINK-29579
             Project: Flink
          Issue Type: Bug
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
    Affects Versions: 1.13.3
            Reporter: Tiansu Yu


While trying to read a parquet file containing the following field as part of 
the schema, 

 
{code:java}
 optional group attribute_values (LIST) {
    repeated group list {
      optional group element {
        optional binary attribute_key_id (STRING);
        optional binary attribute_value_id (STRING);
        optional int32 pos;
      }
    }
  } {code}
 

 I encountered the following problem 

 
{code:java}
Exception in thread "main" java.lang.UnsupportedOperationException: List field 
[optional binary attribute_key_id (STRING)] in List [attribute_values] has to 
be required. 
        at 
org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertGroupElementToArrayTypeInfo(ParquetSchemaConverter.java:338)
        at 
org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertParquetTypeToTypeInfo(ParquetSchemaConverter.java:271)
        at 
org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertFields(ParquetSchemaConverter.java:81)
        at 
org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.fromParquetType(ParquetSchemaConverter.java:61)
        at 
org.apache.flink.formats.parquet.ParquetInputFormat.<init>(ParquetInputFormat.java:120)
        at 
org.apache.flink.formats.parquet.ParquetRowInputFormat.<init>(ParquetRowInputFormat.java:39)
 {code}
The main code that raises the problem goes as follows:

 
{code:java}
private static ObjectArrayTypeInfo convertGroupElementToArrayTypeInfo(
            GroupType arrayFieldType, GroupType elementType) {
        for (Type type : elementType.getFields()) {
            if (!type.isRepetition(Type.Repetition.REQUIRED)) {
                throw new UnsupportedOperationException(
                        String.format(
                                "List field [%s] in List [%s] has to be 
required. ",
                                type.toString(), arrayFieldType.getName()));
            }
        }
        return 
ObjectArrayTypeInfo.getInfoFor(convertParquetTypeToTypeInfo(elementType));
    } {code}
I am not very familiar with internals of Parquet schema. But the problem looks 
like to me is that Flink is too restrictive on repetition types inside certain 
nested fields. Would love to hear some feedbacks on this (improvements, 
corrections / workarounds).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to