[ 
https://issues.apache.org/jira/browse/FLINK-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726541#comment-15726541
 ] 

Robert Metzger commented on FLINK-5039:
---------------------------------------

Just using the Avro serializer doesn't solve the problem here. We also need to 
include the schema into the TypeInformation so that the can initialize the 
GenericRecordSerializer with it.
This also means that users can not use GenericRecords with different schemas in 
the same stream.

Doing a minor avro version bump was meant as a temporary fix to make it at 
least work correctly, before we make it fast :)  I believe a Avro dependency 
upgrade won't destroy old savepoints because avro serialization is determined 
purely by the schema.

> Avro GenericRecord support is broken
> ------------------------------------
>
>                 Key: FLINK-5039
>                 URL: https://issues.apache.org/jira/browse/FLINK-5039
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.1.3
>            Reporter: Bruno Dumon
>            Priority: Blocker
>             Fix For: 1.2.0, 1.1.4
>
>
> Avro GenericRecord support was introduced in FLINK-3691, but it seems like 
> the GenericRecords are not properly (de)serialized.
> This can be easily seen with a program like this:
> {noformat}
>   env.createInput(new AvroInputFormat<>(new Path("somefile.avro"), 
> GenericRecord.class))
>     .first(10)
>     .print();
> {noformat}
> which will print records in which all fields have the same value:
> {noformat}
> {"foo": 1478628723066, "bar": 1478628723066, "baz": 1478628723066, ...}
> {"foo": 1478628723179, "bar": 1478628723179, "baz": 1478628723179, ...}
> {noformat}
> If I'm not mistaken, the AvroInputFormat does essentially 
> TypeExtractor.getForClass(GenericRecord.class), but GenericRecords are not 
> POJOs.
> Furthermore, each GenericRecord contains a pointer to the record schema. I 
> guess the current naive approach will serialize this schema with each record, 
> which is quite inefficient (the schema is typically more complex and much 
> larger than the data). We probably need a TypeInformation and TypeSerializer 
> specific to Avro GenericRecords, which could just use avro serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to