Re: [DISCUSS] FLIP-XXX Apicurio-avro format

Martijn Visser Thu, 18 Jul 2024 10:31:49 -0700

Hi David,

The FLIP is updated.


Cheers, Martijn


On Tue, Jul 16, 2024 at 6:33 PM David Radley <[email protected]>
wrote:

> Hello all,
>
>
>
> I have prototyped the new design and confirmed it works. I have documented
> the design in google doc
> https://docs.google.com/document/d/1J1E-cE-X2H3-kw4rNjLn71OGPQk_Yl1iGX4-eCHWLgE/edit
>
>
>
> By copy Martijn: please could you move this content over to replace the
> content of
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-454%3A+New+Apicurio+Avro+format
>
>
>
> As the design is significantly different, there will be a new period of
> discussion before going to vote.
>
>
>
> Fyi I will be on vacation after this week until the 13th of August.
>
>
>
> Kind regards, David.
>
>
>
>
>
>
>
> *From: *David Radley <[email protected]>
> *Date: *Tuesday, 9 July 2024 at 11:17
> *To: *[email protected] <[email protected]>
> *Subject: *[EXTERNAL] RE: [DISCUSS] FLIP-XXX Apicurio-avro format
>
> Hi Kevin,
> I have agreed a design with Chesnay and Danny. I am implementing a
> prototype, to prove it works,  then will update the Flip text with the new
> design. Initial testing is showing it working.
>
> Here is a quick history so you can understand our current thinking.
>
>   1.  Initially we passed maps for header information from the kafka
> connector to Flink for deserialization. Similar for serialize. This was not
> great, because maps are not ideal and it was a big change as it needed core
> Flink interface changes
>   2.  We then moved the Avro Apicurio format to the Kafka connector and
> looked to discover a new record based de/serialization interface. So we
> could pass down the record (containing the headers) rather than the
> payload. This did not work, because there is a dependence on the Avro
> connector that is not  aware of the new interface.
>   3.  We considered using Thread local storage to pass the headers, we did
> not like this as there was a risk of memory leaks if we did not manage the
> thread well, also the contract is hidden.
>   4.  We then came up with the current design that augments the
> deserialization in the Kafka connector in a new discovered record based
> deserialization, it then takes the headers out in the schema coder, leaving
> the message as it was. Similar for serialization.
>
>
> One piece I need to work out the details of, is how to work when there are
> 2 implementations that can be discovered, probably using an augmented
> format name as a factory identifier,
>
> I hope to put up a new design in the Flip by the end of next week, for
> wider review,
>     Kind regards, David.
>
>
> From: Kevin Lam <[email protected]>
> Date: Monday, 8 July 2024 at 21:16
> To: [email protected] <[email protected]>
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Apicurio-avro format
> Hi David,
>
> Any updates on the Kafka Message Header support? I am also interested in
> supporting headers with the Flink SQL Formats:
> https://lists.apache.org/thread/spl88o63sjm2dv4l5no0ym632d2yt2o6
>
> On Fri, Jun 14, 2024 at 6:10 AM David Radley <[email protected]>
> wrote:
>
> > Hi everyone,
> > I have talked with Chesnay and Danny offline. Danny and I were not very
> > happy with the passing Maps around, and were looking for a neater design.
> > Chesnay suggested that we could move the new format to the Kafka
> connector,
> > then pass the Kafka record down to the deserialize logic so it can make
> use
> > of the headers during deserialization and serialisation.
> >
> > I think this is a neat idea. This would mean:
> > - the Kafka connector code would need to be updated to pass down the
> Kafka
> > record
> > - there would be the Avro Apicurio format and SQL in the kafka
> repository.
> > We feel it is unlikely to want to use the Apicurio registry with files,
> as
> > the Avro format could be used.
> >
> > Unfortunately I have found that this as not so straight forward to
> > implement as the Avro Apicurio format uses the Avro format, which is tied
> > to the DeserializationSchema. We were hoping to have a new decoding
> > implementation that would pass down the Kafka record rather than the
> > payload. This does not appear possible without a Avro format change.
> >
> >
> > Inspired by this idea, I notice that
> > KafkaValueOnlyRecordDeserializerWrapper<T> extends
> > KafkaValueOnlyDeserializerWrapper
> >
> > Does
> >
> > deserializer.deserialize(record.topic(),record.value())
> >
> >
> >
> > I am investigating If I can add a factory/reflection to provide an
> > alternative
> > Implementation that will pass the record based (the kafka record is not
> > serializable so I will pick what we need and deserialize) as a byte
> array.
> >
> > I would need to do this 4 times (value ,key for deserialisation and
> > serialisation. To do this I would need to convert the record into a byte
> > array, so it fits into the existing interface (DeserializationSchema).  I
> > think this could be a way through, to avoid using maps and avoid changing
> > the existing Avro format and avoid change any core Flink interfaces.
> >
> > I am going to prototype this idea. WDYT?
> >
> > My thanks go to Chesnay and Danny for their support and insight around
> > this Flip,
> >    Kind regards, David.
> >
> >
> >
> >
> >
> >
> > From: David Radley <[email protected]>
> > Date: Wednesday, 29 May 2024 at 11:39
> > To: [email protected] <[email protected]>
> > Subject: [EXTERNAL] RE: [DISCUSS] FLIP-XXX Apicurio-avro format
> > Hi Danny,
> > Thank you for your feedback on this.
> >
> > I agree that using maps has pros and cons. The maps are flexible, but do
> > require the sender and receiver to know what is in the map.
> >
> > When you say “That sounds like it would fit in better, I assume we cannot
> > just take that approach?” The motivation behind this Flip is to support
> the
> > headers which is the usual way that Apicurio runs. We will support the
> > “schema id in the payload” as well.
> >
> > I agree with you when you say “ I am not 100% happy with the solution
> but I
> > cannot offer a better option.” – this is a pragmatic way we have found to
> > solve this issue. I am open to any suggestions to improve this as well.
> >
> > If we are going with the maps design (which is the best we have at the
> > moment) ; it would be good to have the Flink core changes in base Flink
> > version 2.0 as this would mean we do not need to use reflection in a
> Flink
> > Kafka version 2 connector to work out if the runtime Flink has the new
> > methods.
> >
> > At this stage we only have one committer (yourself) backing this. Do you
> > know of other 2 committers who would support this Flip?
> >
> >      Kind regards, David.
> >
> >
> >
> > From: Danny Cranmer <[email protected]>
> > Date: Friday, 24 May 2024 at 19:32
> > To: [email protected] <[email protected]>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Apicurio-avro format
> > Hello,
> >
> > > I am curious what you mean by abused.
> >
> > I just meant we will end up adding more and more fields to this map over
> > time, and it may be hard to undo.
> >
> > > For Apicurio it can be sent at the start of the payload like Confluent
> > Avro does. Confluent Avro have a magic byte followed by 4 bytes of schema
> > id, at the start of the payload. Apicurio clients and SerDe libraries can
> > be configured to not put the schema id in the headers in which case there
> > is a magic byte followed by an 8 byte schema at the start of the payload.
> > In the deserialization case, we would not need to look at the headers –
> > though the encoding is also in the headers.
> >
> > That sounds like it would fit in better, I assume we cannot just take
> that
> > approach?
> >
> > Thanks for the discussion. I am not 100% happy with the solution but I
> > cannot offer a better option. I would be interested to hear if others
> have
> > any suggestions. Playing devil's advocate against myself, we pass maps
> > around to configure connectors so it is not too far away from that.
> >
> > Thanks,
> > Danny
> >
> >
> > On Fri, May 24, 2024 at 2:23 PM David Radley <[email protected]>
> > wrote:
> >
> > > Hi Danny,
> > > No worries, thanks for replying. I have working prototype code that is
> > > being reviewed. It needs some cleaning up and more complete testing
> > before
> > > it is ready, but will give you the general idea [1][2] to help to
> assess
> > > this approach.
> > >
> > >
> > > I am curious what you mean by abused. I guess the approaches are
> between
> > > generic map, mechanism vs a more particular more granular things being
> > > passed that might be used by another connector.
> > >
> > > Your first question:
> > > “how would this work if the schema ID is not in the Kafka headers, as
> > > hinted to in the FLIP "usually the global ID in a Kafka header"?
> > >
> > > For Apicurio it can be sent at the start of the payload like Confluent
> > > Avro does. Confluent Avro have a magic byte followed by 4 bytes of
> schema
> > > id, at the start of the payload. Apicurio clients and SerDe libraries
> can
> > > be configured to not put the schema id in the headers in which case
> there
> > > is a magic byte followed by an 8 byte schema at the start of the
> payload.
> > > In the deserialization case, we would not need to look at the headers –
> > > though the encoding is also in the headers.
> > >
> > > Your second question:
> > > “I am wondering if there are any other instances where the source would
> > be
> > > aware of the schema ID and pass it through in this way?
> > > ”
> > > The examples I can think of are:
> > > - Avro can send the complete schema in a header, this is not
> recommended
> > > but in theory fits the need for a message payload to require something
> > else
> > > to get the structure.
> > > - I see [2] that Apicurio Protobuf uses headers.
> > > - it might be that other message queuing projects like Rabbit MQ would
> > > need this to be able to support Apicurio Avro & protobuf.
> > >
> > > Kind regards, David,
> > >
> > >
> > >
> > >
> > > [1] https://github.com/apache/flink/pull/24715
> > > [2] https://github.com/apache/flink-connector-kafka/pull/99
> > > [3]
> > >
> >
> https://www.apicur.io/registry/docs/apicurio-registry/2.5.x/getting-started/assembly-configuring-kafka-client-serdes.html#registry-serdes-types-json_registry
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: Danny Cranmer <[email protected]>
> > > Date: Friday, 24 May 2024 at 12:22
> > > To: [email protected] <[email protected]>
> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Apicurio-avro format
> > > Hello,
> > >
> > > Apologies, I am on vacation and have limited access to email.
> > >
> > > I can see the logic here and why you ended up where you did. I can also
> > see
> > > there are other useful metadata fields that we might want to pass
> > through,
> > > which might result in this Map being abused (Kafka Topic, Kinesis
> Shard,
> > > etc).
> > >
> > > I have a follow up question, how would this work if the schema ID is
> not
> > in
> > > the Kafka headers, as hinted to in the FLIP "usually the global ID in a
> > > Kafka header"? I am wondering if there are any other instances where
> the
> > > source would be aware of the schema ID and pass it through in this way?
> > >
> > > Thanks,
> > > Danny
> > >
> > >
> > >
> > > On Wed, May 22, 2024 at 3:43 PM David Radley <[email protected]>
> > > wrote:
> > >
> > > > Hi Danny,
> > > > Did you have a chance you have a look at my responses to your
> > feedback? I
> > > > am hoping to keep the momentum going on this one,   kind regards,
> > David.
> > > >
> > > >
> > > > From: David Radley <[email protected]>
> > > > Date: Tuesday, 14 May 2024 at 17:21
> > > > To: [email protected] <[email protected]>
> > > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Apicurio-avro format
> > > > Hi Danny,
> > > >
> > > > Thank you very much for the feedback and your support. I have copied
> > your
> > > > feedback from the VOTE thread to this discussion thread, so we can
> > > continue
> > > > our discussions off the VOTE thread.
> > > >
> > > >
> > > >
> > > > Your feedback:
> > > >
> > > > Thanks for Driving this David. I am +1 for adding support for the new
> > > >
> > > > format, however have some questions/suggestions on the details.
> > > >
> > > >
> > > >
> > > > 1. Passing around Map<String, Object> additionalInputProperties
> feels a
> > > bit
> > > >
> > > > dirty. It looks like this is mainly for the Kafka connector. This
> > > connector
> > > >
> > > > already has a de/serialization schema extension to access record
> > > >
> > > > headers, KafkaRecordDeserializationSchema [1], can we use this
> instead?
> > > >
> > > > 2. Can you elaborate why we need to change the SchemaCoder interface?
> > > Again
> > > >
> > > > I am not a fan of adding these Map parameters
> > > >
> > > > 3. I assume this integration will go into the core Flink repo under
> > > >
> > > > flink-formats [2], and not be a separate repository like the
> > connectors?
> > > >
> > > >
> > > >
> > > > My response:
> > > >
> > > > Addressing 1. and 2.
> > > >
> > > > I agree that sending maps around is a bit dirty. If we can see a
> better
> > > > way that would be great. I was looking for a way to pass this kafka
> > > header
> > > > information in a non-Kafka way - the most obvious way I could think
> was
> > > as
> > > > a map. Here are the main considerations I saw, if I have missed
> > anything
> > > or
> > > > could improve something I would be grateful for any further feedback.
> > > >
> > > >
> > > >
> > > >   *   I see KafkaRecordDeserializationSchema is a Kafka interface
> that
> > > > works at the Kafka record level (so includes the headers). We need a
> > > > mechanism to send over the headers from the Kafka record to Flink
> > > >   *   Flink core is not aware of Kafka headers, and I did not want to
> > add
> > > > a Kafka dependancy to core flink.
> > > >   *   The formats are stateless so it did not appear to be in fitting
> > > with
> > > > the Flink architecture to pass through header information to stash in
> > > state
> > > > in the format waiting for the deserialise to be subsequently called
> to
> > > pick
> > > > up the header information.
> > > >   *   We could have used Thread local storage to stash the header
> > > content,
> > > > but this would be extra state to manage; and this would seem like an
> > > > obtrusive change.
> > > >   *   The SchemaCoder deserialise is where Confluent Avro gets the
> > schema
> > > > id from the payload, so it can lookup the schema. In line with this
> > > > approach it made sense to extend the deserialise so it had the header
> > > > contents so the Apicurio Avro format could lookup the schema.
> > > >   *   I did not want to have Apicurio specific logic in the Kafka
> > > > connector, if we did we could pull out the appropriate headers and
> only
> > > > send over the schema ids.
> > > >   *   For deserialise, the schema id we are interested in is the one
> in
> > > > the Kafka headers on the message and is for the writer schema (an
> Avro
> > > > format concept) currently used by the confluent-avro format in
> > > deserialize.
> > > >   *   For serialise the schema ids need to be obtained from apicurio
> > then
> > > > passed through to Kafka.
> > > >   *   For serialise there is existing logic around handling the
> > metadata
> > > > which includes passing the headers. But the presence of the metadata
> > > would
> > > > imply we have a metadata column. Maybe a change to the metadata
> > mechanism
> > > > may have allowed to use to pass the headers, but not create a
> metadata
> > > > column; instead I pass through the additional headers in a map to be
> > > > appended.
> > > >
> > > >
> > > >
> > > > 3.
> > > >
> > > > Yes this integration will go into the core Flink repo under
> > > >
> > > > flink-formats and sit next to the confluent-avro format. The Avro
> > format
> > > > has the concept of a Registry and drives the confluent-avro format.
> The
> > > > Apicurio Avro format will use the same approach.
> > > >
> > > > Unless otherwise stated above:
> > > >
> > > > IBM United Kingdom Limited
> > > > Registered in England and Wales with number 741598
> > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> > > >
> > > > Unless otherwise stated above:
> > > >
> > > > IBM United Kingdom Limited
> > > > Registered in England and Wales with number 741598
> > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> > > >
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] FLIP-XXX Apicurio-avro format

Reply via email to