GitHub user prateekm opened a pull request:

    https://github.com/apache/samza/pull/293

    Support for declaring serdes in high level API code for 
input/output/intermediate streams.

    @nickpan47 @vjagadish1989 @jmakes Please take a look.
    
    Some notes/considerations:
    * Serde interface is now Serializable. I've updated the built-in serde 
implementations to make sure they are Serializable, but custom serde 
implementations will need to be updated by users when they upgrade.
    * I've tried to preserve the number of serde objects and their usage before 
and after serialization, but don't guarantee that any references shared by 
these serde instances before serialization will be shared after 
deserialization. Ideally, serde of serdes should be done in one pass along with 
all other user provided objects in the DAG so that any shared references can be 
preserved.
    * I've left them as-is for now, but we probably need to move Serdes to 
samza-api now that users can interact with them directly.
    * Samza's JsonSerde uses Samza's custom ObjectMapper since it was probably 
meant to be an Samza internal serde. One peculiarity is that it forces the 
'dasherized-names' property naming convention instead of the (default in 
Jackson and more common) camelCase naming convention. This makes user POJO 
definitions more verbose with JsonProperty annotations, and if users don't 
remember to dasherize property names the resulting errors are somewhat 
confusing to debug. Should we create a new JsonSerde with a default 
ObjectMapper when we move it to samza-api?
    * Note that the default value for a 'defaultKey/MsgSerde` is a no-op serde 
instead of null. This helps users of Systems that do the 
serialization/deserialization themselves (e.g. Wikipedia example), but the 
downside is that it is possible to forget setting the default serde before 
creating streams and end up with a no-op serde and ClassCastExceptions at 
runtime.
    * I've made a choice here to disallow setting system/stream serdes in 
config to make configs simpler. Imho the 'serdes only in code' and 'only 2 
levels of serde' model is much simpler to understand. The risk is that there 
might be streams that aren't part of the user application code, but need their 
serdes to be set in configuration. An example is the logging topic. Do you 
think this is a reasonable choice?
    * There's a separate discussion to be had about introducing a KV type and 
specifying serdes for such KV types instead of specifying 'key' and 'msg' 
serdes separately.
    
    Tested with samza-hello-world Wikipedia application and all 
TestRepartitionWindowApp. Will add unit tests if the overall design is 
acceptable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/prateekm/samza serde-instance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #293
    
----
commit 0d399607064faecc983526c255c0b4037ae84aa0
Author: Prateek Maheshwari <pmahe...@linkedin.com>
Date:   2017-09-08T21:27:58Z

    Adding support for declaring serdes in high level API code for 
input/output/intermediate stream.

----


---

Reply via email to