Re: [DISCUSS] KIP-466: Add support for List serialization and deserialization

Development Wed, 26 Jun 2019 08:33:06 -0700

Hey, 

Finally made updates to the KIP: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-466%3A+Add+support+for+List%3CT%3E+serialization+and+deserialization
 
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-466:+Add+support+for+List%3CT%3E+serialization+and+deserialization>
Sorry for the delay :)


Thank You!

Best,
Daniyar Yeralin

> On Jun 22, 2019, at 12:49 AM, Matthias J. Sax <matth...@confluent.io> wrote:
> 
> Yes, something like this. I did not think about good configuration
> parameter names yet. I am also not sure if I understand all proposed
> configs atm. But all configs should be listed and explained in the KIP
> anyway, and we can discuss further after you have updated the KIP (I can
> ask more detailed question if I have any).
> 
> 
> -Matthias
> 
> On 6/21/19 2:05 PM, Development wrote:
>> Yes, you are right. ByteSerializer is not what I need to have in a list
>> of primitives.
>> 
>> As for the default constructor and configurability, just want to make
>> sure. Is this what you have on your mind?
>> 
>> Best,
>> Daniyar Yeralin
>> 
>> 
>> 
>>> On Jun 21, 2019, at 2:51 PM, Matthias J. Sax <matth...@confluent.io
>>> <mailto:matth...@confluent.io>> wrote:
>>> 
>>> Thanks for the update!
>>> 
>>> I think that `ListDeserializer`, `ListSerializer`, and `ListSerde`
>>> should have an default constructor and it should be possible to pass in
>>> the `Class listClass` information via a configuration. Otherwise,
>>> KafkaStreams cannot use it as default serde.
>>> 
>>> 
>>> For the primitive serializers: `BytesSerializer` is not primitive IMHO,
>>> as is it for `byte[]` with variable length -- it's for arrays, not for
>>> single `byte` (note, that `Bytes` is a Kafka class wrapping `byte[]`).
>>> 
>>> 
>>> For tests, we can comment on the PR. No need to do this in the KIP
>>> discussion.
>>> 
>>> 
>>> Can you also update the KIP?
>>> 
>>> 
>>> 
>>> -Matthias
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 6/21/19 11:29 AM, Development wrote:
>>>> I made and pushed necessary commits, so we could review the final
>>>> version under PR https://github.com/apache/kafka/pull/6592
>>>> 
>>>> I also need some advice on writing tests for this new serde. So far I
>>>> only have two test cases (roundtrip and empty payload), I’m not sure
>>>> if it is enough.
>>>> 
>>>> Thank y’all for your help in this KIP :)
>>>> 
>>>> Best,
>>>> Daniyar Yeralin
>>>> 
>>>> 
>>>>> On Jun 21, 2019, at 1:44 PM, John Roesler <j...@confluent.io
>>>>> <mailto:j...@confluent.io>> wrote:
>>>>> 
>>>>> Hey Daniyar,
>>>>> 
>>>>> Looks good to me! Thanks for considering it.
>>>>> 
>>>>> Thanks,
>>>>> -John
>>>>> 
>>>>> On Fri, Jun 21, 2019 at 9:04 AM Development <d...@yeralin.net
>>>>> <mailto:d...@yeralin.net> <mailto:d...@yeralin.net>> wrote:
>>>>> Hey John and Matthias,
>>>>> 
>>>>> Yes, now I see it all. I’m storing lots of redundant information.
>>>>> Here is my final idea. Yes, now a user should pass a list type. I
>>>>> realized that’s the type is not really needed in ListSerializer, but
>>>>> only in ListDeserializer:
>>>>> 
>>>>> 
>>>>> In ListSerializer we will start storing sizes only if serializer is
>>>>> not a primitive serializer:
>>>>> 
>>>>> 
>>>>> Then, in deserializer, we persist passed list type, so that during
>>>>> deserialization we could create an instance of it with predefined
>>>>> listSize for better performance.
>>>>> We also try to locate a primitiveSize based on passed deserializer.
>>>>> If it is not there, then primitiveSize will be null. Which means
>>>>> that each entry’s size was encoded individually.
>>>>> 
>>>>> 
>>>>> This looks much cleaner and more concise.
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> Best,
>>>>> Daniyar Yeralin
>>>>> 
>>>>>> On Jun 20, 2019, at 5:45 PM, Matthias J. Sax <matth...@confluent.io
>>>>>> <mailto:matth...@confluent.io> <mailto:matth...@confluent.io>> wrote:
>>>>>> 
>>>>>> For encoding the list-type: I see John's point about re-encoding the
>>>>>> list-type redundantly. However, I also don't like the idea that the
>>>>>> Deserializer returns a fixed type...
>>>>>> 
>>>>>> Maybe it's best allow users to specify the target list type on
>>>>>> deserialization via config?
>>>>>> 
>>>>>> Similar for the primitive types: I don't think we need to encode the
>>>>>> type size, but users could specify the type on the deserializer (via a
>>>>>> config again)?
>>>>>> 
>>>>>> 
>>>>>> About generics: nesting could be arbitrarily deep. Hence, I doubt
>>>>>> we can
>>>>>> support this and a cast will be necessary at some point in the user
>>>>>> code.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 6/20/19 1:21 PM, John Roesler wrote:
>>>>>>> Hey Daniyar,
>>>>>>> 
>>>>>>> Thanks for looking at it!
>>>>>>> 
>>>>>>> Something like your screenshot is more along the lines of what I was
>>>>>>> thinking. Sorry, but I didn't follow what you mean, how would that not
>>>>>>> be "vanilla java"?
>>>>>>> 
>>>>>>> Unfortunately the deserializer needs more information, though. For
>>>>>>> example, what if the inner type is a Map<String,String>? The serde
>>>>>>> could
>>>>>>> only be used to produce a LinkedList<Map>, thus, we'd still need an
>>>>>>> inner serde, like you have in the KIP (Serde<T> innerSerde).
>>>>>>> 
>>>>>>> Something more like Serde<LinkedList<MyRecord>> = Serdes.listSerde(
>>>>>>>  /**list type**/ LinkedList.class,
>>>>>>>  /**inner serde**/ new MyRecordSerde()
>>>>>>> )
>>>>>>> 
>>>>>>> And in configuration, it's something like:
>>>>>>> default.key.serde: org...ListSerde
>>>>>>> default.key.list.serde.type: java.util.LinkedList
>>>>>>> default.key.list.serde.inner: com.mycompany.MyRecordSerde
>>>>>>> 
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> Thanks,
>>>>>>> -John
>>>>>>> 
>>>>>>> On Thu, Jun 20, 2019 at 2:46 PM Development <d...@yeralin.net
>>>>>>> <mailto:d...@yeralin.net> <mailto:d...@yeralin.net>
>>>>>>> <mailto:d...@yeralin.net <mailto:d...@yeralin.net>>> wrote:
>>>>>>> 
>>>>>>>   Hey John,
>>>>>>> 
>>>>>>>   I gave read about TypeReference. It could work for the list serde.
>>>>>>>   However, it is not directly
>>>>>>>   supported:
>>>>>>> https://github.com/FasterXML/jackson-databind/issues/1490
>>>>>>> <https://github.com/FasterXML/jackson-databind/issues/1490>
>>>>>>>   The only way is to pass an actual class object into the constructor,
>>>>>>>   something like:
>>>>>>> 
>>>>>>>   It could be an option, but not a pretty one. What do you think of my
>>>>>>>   approach to use vanilla java and canonical class name? (As described
>>>>>>>   previously)
>>>>>>> 
>>>>>>>   Best,
>>>>>>>   Daniyar Yeralin
>>>>>>> 
>>>>>>>>   On Jun 20, 2019, at 2:45 PM, Development <d...@yeralin.net
>>>>>>>> <mailto:d...@yeralin.net> <mailto:d...@yeralin.net>
>>>>>>>>   <mailto:d...@yeralin.net <mailto:d...@yeralin.net>>> wrote:
>>>>>>>> 
>>>>>>>>   Hi John,
>>>>>>>> 
>>>>>>>>   Thank you for your input! Yes, my idea looks a little bit over
>>>>>>>>   engineered :)
>>>>>>>> 
>>>>>>>>   I also wanted to see a feedback from Mathias as well since he gave
>>>>>>>>   me an idea about storing fixed/variable size entries.
>>>>>>>> 
>>>>>>>>   Best,
>>>>>>>>   Daniyar Yeralin
>>>>>>>> 
>>>>>>>>>   On Jun 18, 2019, at 6:06 PM, John Roesler <j...@confluent.io
>>>>>>>>> <mailto:j...@confluent.io> <mailto:j...@confluent.io>
>>>>>>>>>   <mailto:j...@confluent.io <mailto:j...@confluent.io>>> wrote:
>>>>>>>>> 
>>>>>>>>>   Hi Daniyar,
>>>>>>>>> 
>>>>>>>>>   That's a very clever solution!
>>>>>>>>> 
>>>>>>>>>   One observation is that, now, this is what we might call a
>>>>>>>>>   polymorphic
>>>>>>>>>   serde. That is, you're detecting the actual concrete type and then
>>>>>>>>>   promising to produce the exact same concrete type on read.
>>>>>>>>> There are
>>>>>>>>>   some inherent problems with this approach, which in general
>>>>>>>>> require
>>>>>>>>>   some kind of  schema registry (not necessarily Schema
>>>>>>>>> Registry, just
>>>>>>>>>   any registry for schemas) to solve.
>>>>>>>>> 
>>>>>>>>>   Notice that every serialized record has quite a bit of duplicated
>>>>>>>>>   information: the concrete type as well as a byte to indicate
>>>>>>>>> whether
>>>>>>>>>   the value type is a fixed size, and, if so, an integer to
>>>>>>>>>   indicate the
>>>>>>>>>   actual size. These constitute a schema, of sorts, because they
>>>>>>>>>   tell us
>>>>>>>>>   later how exactly to deserialize the data. Unfortunately, this
>>>>>>>>>   information is completely redundant. In all likelihood, the
>>>>>>>>>   information will be exactly the same for every record in the
>>>>>>>>> topic.
>>>>>>>>>   This problem is essentially the core motivation for serializations
>>>>>>>>>   like Avro: to move the schema outside of the serialization
>>>>>>>>> itself, so
>>>>>>>>>   that the records won't contain so much redundant information.
>>>>>>>>> 
>>>>>>>>>   In this light, I'm wondering if it makes sense to go back to
>>>>>>>>>   something
>>>>>>>>>   like what you had earlier in which you don't support perfectly
>>>>>>>>>   preserving the concrete type for _this_ serde, but instead just
>>>>>>>>>   support deserializing to _some_ List. Then, you could defer full,
>>>>>>>>>   perfect, type preservation to serdes that have an external
>>>>>>>>> system in
>>>>>>>>>   which to register their type information.
>>>>>>>>> 
>>>>>>>>>   There does exist an alternative, if we really do want to
>>>>>>>>> preserve the
>>>>>>>>>   concrete type (which does seem kind of nice). You can add a
>>>>>>>>>   configuration option specifically for the serde to configure
>>>>>>>>> what the
>>>>>>>>>   list type will be, and maybe what the element type is, as well.
>>>>>>>>> 
>>>>>>>>>   As far as "related work" goes, you might be interested to take
>>>>>>>>> a look
>>>>>>>>>   at how Jackson can be configured to deserialize into a specific,
>>>>>>>>>   arbitrarily nested, generically parameterized class structure.
>>>>>>>>>   Specifically, you might find
>>>>>>>>>   
>>>>>>>>> https://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/type/TypeReference.html
>>>>>>>>> <https://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/type/TypeReference.html>
>>>>>>>>>   interesting.
>>>>>>>>> 
>>>>>>>>>   Thanks,
>>>>>>>>>   -John
>>>>>>>>> 
>>>>>>>>>   On Mon, Jun 17, 2019 at 12:38 PM Development <d...@yeralin.net
>>>>>>>>> <mailto:d...@yeralin.net> <mailto:d...@yeralin.net>
>>>>>>>>>   <mailto:d...@yeralin.net <mailto:d...@yeralin.net>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   bump
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] KIP-466: Add support for List serialization and deserialization

Reply via email to