Re: [DISCUSS] FLIP-XXX Apicurio-avro format

Kevin Lam Mon, 08 Jul 2024 13:17:28 -0700

Hi David,

Any updates on the Kafka Message Header support? I am also interested in
supporting headers with the Flink SQL Formats:
https://lists.apache.org/thread/spl88o63sjm2dv4l5no0ym632d2yt2o6


On Fri, Jun 14, 2024 at 6:10 AM David Radley <david_rad...@uk.ibm.com>
wrote:

> Hi everyone,
> I have talked with Chesnay and Danny offline. Danny and I were not very
> happy with the passing Maps around, and were looking for a neater design.
> Chesnay suggested that we could move the new format to the Kafka connector,
> then pass the Kafka record down to the deserialize logic so it can make use
> of the headers during deserialization and serialisation.
>
> I think this is a neat idea. This would mean:
> - the Kafka connector code would need to be updated to pass down the Kafka
> record
> - there would be the Avro Apicurio format and SQL in the kafka repository.
> We feel it is unlikely to want to use the Apicurio registry with files, as
> the Avro format could be used.
>
> Unfortunately I have found that this as not so straight forward to
> implement as the Avro Apicurio format uses the Avro format, which is tied
> to the DeserializationSchema. We were hoping to have a new decoding
> implementation that would pass down the Kafka record rather than the
> payload. This does not appear possible without a Avro format change.
>
>
> Inspired by this idea, I notice that
> KafkaValueOnlyRecordDeserializerWrapper<T> extends
> KafkaValueOnlyDeserializerWrapper
>
> Does
>
> deserializer.deserialize(record.topic(),record.value())
>
>
>
> I am investigating If I can add a factory/reflection to provide an
> alternative
> Implementation that will pass the record based (the kafka record is not
> serializable so I will pick what we need and deserialize) as a byte array.
>
> I would need to do this 4 times (value ,key for deserialisation and
> serialisation. To do this I would need to convert the record into a byte
> array, so it fits into the existing interface (DeserializationSchema).  I
> think this could be a way through, to avoid using maps and avoid changing
> the existing Avro format and avoid change any core Flink interfaces.
>
> I am going to prototype this idea. WDYT?
>
> My thanks go to Chesnay and Danny for their support and insight around
> this Flip,
>    Kind regards, David.
>
>
>
>
>
>
> From: David Radley <david_rad...@uk.ibm.com>
> Date: Wednesday, 29 May 2024 at 11:39
> To: dev@flink.apache.org <dev@flink.apache.org>
> Subject: [EXTERNAL] RE: [DISCUSS] FLIP-XXX Apicurio-avro format
> Hi Danny,
> Thank you for your feedback on this.
>
> I agree that using maps has pros and cons. The maps are flexible, but do
> require the sender and receiver to know what is in the map.
>
> When you say “That sounds like it would fit in better, I assume we cannot
> just take that approach?” The motivation behind this Flip is to support the
> headers which is the usual way that Apicurio runs. We will support the
> “schema id in the payload” as well.
>
> I agree with you when you say “ I am not 100% happy with the solution but I
> cannot offer a better option.” – this is a pragmatic way we have found to
> solve this issue. I am open to any suggestions to improve this as well.
>
> If we are going with the maps design (which is the best we have at the
> moment) ; it would be good to have the Flink core changes in base Flink
> version 2.0 as this would mean we do not need to use reflection in a Flink
> Kafka version 2 connector to work out if the runtime Flink has the new
> methods.
>
> At this stage we only have one committer (yourself) backing this. Do you
> know of other 2 committers who would support this Flip?
>
>      Kind regards, David.
>
>
>
> From: Danny Cranmer <dannycran...@apache.org>
> Date: Friday, 24 May 2024 at 19:32
> To: dev@flink.apache.org <dev@flink.apache.org>
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Apicurio-avro format
> Hello,
>
> > I am curious what you mean by abused.
>
> I just meant we will end up adding more and more fields to this map over
> time, and it may be hard to undo.
>
> > For Apicurio it can be sent at the start of the payload like Confluent
> Avro does. Confluent Avro have a magic byte followed by 4 bytes of schema
> id, at the start of the payload. Apicurio clients and SerDe libraries can
> be configured to not put the schema id in the headers in which case there
> is a magic byte followed by an 8 byte schema at the start of the payload.
> In the deserialization case, we would not need to look at the headers –
> though the encoding is also in the headers.
>
> That sounds like it would fit in better, I assume we cannot just take that
> approach?
>
> Thanks for the discussion. I am not 100% happy with the solution but I
> cannot offer a better option. I would be interested to hear if others have
> any suggestions. Playing devil's advocate against myself, we pass maps
> around to configure connectors so it is not too far away from that.
>
> Thanks,
> Danny
>
>
> On Fri, May 24, 2024 at 2:23 PM David Radley <david_rad...@uk.ibm.com>
> wrote:
>
> > Hi Danny,
> > No worries, thanks for replying. I have working prototype code that is
> > being reviewed. It needs some cleaning up and more complete testing
> before
> > it is ready, but will give you the general idea [1][2] to help to assess
> > this approach.
> >
> >
> > I am curious what you mean by abused. I guess the approaches are between
> > generic map, mechanism vs a more particular more granular things being
> > passed that might be used by another connector.
> >
> > Your first question:
> > “how would this work if the schema ID is not in the Kafka headers, as
> > hinted to in the FLIP "usually the global ID in a Kafka header"?
> >
> > For Apicurio it can be sent at the start of the payload like Confluent
> > Avro does. Confluent Avro have a magic byte followed by 4 bytes of schema
> > id, at the start of the payload. Apicurio clients and SerDe libraries can
> > be configured to not put the schema id in the headers in which case there
> > is a magic byte followed by an 8 byte schema at the start of the payload.
> > In the deserialization case, we would not need to look at the headers –
> > though the encoding is also in the headers.
> >
> > Your second question:
> > “I am wondering if there are any other instances where the source would
> be
> > aware of the schema ID and pass it through in this way?
> > ”
> > The examples I can think of are:
> > - Avro can send the complete schema in a header, this is not recommended
> > but in theory fits the need for a message payload to require something
> else
> > to get the structure.
> > - I see [2] that Apicurio Protobuf uses headers.
> > - it might be that other message queuing projects like Rabbit MQ would
> > need this to be able to support Apicurio Avro & protobuf.
> >
> > Kind regards, David,
> >
> >
> >
> >
> > [1] https://github.com/apache/flink/pull/24715
> > [2] https://github.com/apache/flink-connector-kafka/pull/99
> > [3]
> >
> https://www.apicur.io/registry/docs/apicurio-registry/2.5.x/getting-started/assembly-configuring-kafka-client-serdes.html#registry-serdes-types-json_registry
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Danny Cranmer <dannycran...@apache.org>
> > Date: Friday, 24 May 2024 at 12:22
> > To: dev@flink.apache.org <dev@flink.apache.org>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Apicurio-avro format
> > Hello,
> >
> > Apologies, I am on vacation and have limited access to email.
> >
> > I can see the logic here and why you ended up where you did. I can also
> see
> > there are other useful metadata fields that we might want to pass
> through,
> > which might result in this Map being abused (Kafka Topic, Kinesis Shard,
> > etc).
> >
> > I have a follow up question, how would this work if the schema ID is not
> in
> > the Kafka headers, as hinted to in the FLIP "usually the global ID in a
> > Kafka header"? I am wondering if there are any other instances where the
> > source would be aware of the schema ID and pass it through in this way?
> >
> > Thanks,
> > Danny
> >
> >
> >
> > On Wed, May 22, 2024 at 3:43 PM David Radley <david_rad...@uk.ibm.com>
> > wrote:
> >
> > > Hi Danny,
> > > Did you have a chance you have a look at my responses to your
> feedback? I
> > > am hoping to keep the momentum going on this one,   kind regards,
> David.
> > >
> > >
> > > From: David Radley <david_rad...@uk.ibm.com>
> > > Date: Tuesday, 14 May 2024 at 17:21
> > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Apicurio-avro format
> > > Hi Danny,
> > >
> > > Thank you very much for the feedback and your support. I have copied
> your
> > > feedback from the VOTE thread to this discussion thread, so we can
> > continue
> > > our discussions off the VOTE thread.
> > >
> > >
> > >
> > > Your feedback:
> > >
> > > Thanks for Driving this David. I am +1 for adding support for the new
> > >
> > > format, however have some questions/suggestions on the details.
> > >
> > >
> > >
> > > 1. Passing around Map<String, Object> additionalInputProperties feels a
> > bit
> > >
> > > dirty. It looks like this is mainly for the Kafka connector. This
> > connector
> > >
> > > already has a de/serialization schema extension to access record
> > >
> > > headers, KafkaRecordDeserializationSchema [1], can we use this instead?
> > >
> > > 2. Can you elaborate why we need to change the SchemaCoder interface?
> > Again
> > >
> > > I am not a fan of adding these Map parameters
> > >
> > > 3. I assume this integration will go into the core Flink repo under
> > >
> > > flink-formats [2], and not be a separate repository like the
> connectors?
> > >
> > >
> > >
> > > My response:
> > >
> > > Addressing 1. and 2.
> > >
> > > I agree that sending maps around is a bit dirty. If we can see a better
> > > way that would be great. I was looking for a way to pass this kafka
> > header
> > > information in a non-Kafka way - the most obvious way I could think was
> > as
> > > a map. Here are the main considerations I saw, if I have missed
> anything
> > or
> > > could improve something I would be grateful for any further feedback.
> > >
> > >
> > >
> > >   *   I see KafkaRecordDeserializationSchema is a Kafka interface that
> > > works at the Kafka record level (so includes the headers). We need a
> > > mechanism to send over the headers from the Kafka record to Flink
> > >   *   Flink core is not aware of Kafka headers, and I did not want to
> add
> > > a Kafka dependancy to core flink.
> > >   *   The formats are stateless so it did not appear to be in fitting
> > with
> > > the Flink architecture to pass through header information to stash in
> > state
> > > in the format waiting for the deserialise to be subsequently called to
> > pick
> > > up the header information.
> > >   *   We could have used Thread local storage to stash the header
> > content,
> > > but this would be extra state to manage; and this would seem like an
> > > obtrusive change.
> > >   *   The SchemaCoder deserialise is where Confluent Avro gets the
> schema
> > > id from the payload, so it can lookup the schema. In line with this
> > > approach it made sense to extend the deserialise so it had the header
> > > contents so the Apicurio Avro format could lookup the schema.
> > >   *   I did not want to have Apicurio specific logic in the Kafka
> > > connector, if we did we could pull out the appropriate headers and only
> > > send over the schema ids.
> > >   *   For deserialise, the schema id we are interested in is the one in
> > > the Kafka headers on the message and is for the writer schema (an Avro
> > > format concept) currently used by the confluent-avro format in
> > deserialize.
> > >   *   For serialise the schema ids need to be obtained from apicurio
> then
> > > passed through to Kafka.
> > >   *   For serialise there is existing logic around handling the
> metadata
> > > which includes passing the headers. But the presence of the metadata
> > would
> > > imply we have a metadata column. Maybe a change to the metadata
> mechanism
> > > may have allowed to use to pass the headers, but not create a metadata
> > > column; instead I pass through the additional headers in a map to be
> > > appended.
> > >
> > >
> > >
> > > 3.
> > >
> > > Yes this integration will go into the core Flink repo under
> > >
> > > flink-formats and sit next to the confluent-avro format. The Avro
> format
> > > has the concept of a Registry and drives the confluent-avro format. The
> > > Apicurio Avro format will use the same approach.
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] FLIP-XXX Apicurio-avro format

Reply via email to