Hi, Xingcan

After deep dive into the source code, I also think it is a bug.

Best,
Ron

Xingcan Cui <xingc...@gmail.com> 于2023年8月5日周六 23:27写道:

> Hi all,
>
> We tried to read some Avro files with the Flink SQL (1.16.1) and noticed
> that the projection pushdown seems to be buggy.
>
> The Avro schema we used has 4 fields, namely f1, f2, f3 and f4. When using
> "SELECT *" or SELECT the first n fields (e.g., SELECT f1 or SELECT f1, f2)
> to read the table, it works fine. However, if we query an arbitrary field
> other than f1, a data & converter mismatch exception will show.
>
> After digging into the code, I figured out that `AvroFileFormatFactory`
> generates a `producedDataType` with projection push down. When generating
> AvroToRowDataConverters, it only considers the selected fields and
> generates converters for them. However, the records read by the
> DataFileReader contain all fields.
>
> Specifically, for the following code snippet from AvroToRowDataConverters,
> the fieldConverters contains only the converters for the selected fields
> but the record contains all fields which leads to a converters & data
> fields mismatch problem. That also explains why selecting the first n
> fields works (It's because the converters & data fields happen to match).
>
> ```
> return avroObject -> {
>       IndexedRecord record = (IndexedRecord) avroObject;
>       GenericRowData row = new GenericRowData(arity);
>       for (int i = 0; i < arity; ++i) {
>         // avro always deserialize successfully even though the type isn't
> matched
>         // so no need to throw exception about which field can't be
> deserialized
>         row.setField(i, fieldConverters[i].convert(record.get(i)));
>       }
>       return row;
>     };
> ```
>
> Not sure if any of you hit this before. If it's confirmed to be a bug, I'll
> file a ticket and try to fix it.
>
> Best,
> Xingcan
>

Reply via email to