Re: Handling custom types with Kryo

Alexander Alexandrov Tue, 20 Jan 2015 05:31:09 -0800

I think we are talking about the Operator I/O types, as types used
internally only by the UDFs should not be serialized.


2015-01-20 12:27 GMT+01:00 Timo Walther <[email protected]>:

> Are we talking about the types for the input/output of operators or also
> types that are used inside UDFs?
> Operator I/O type classes are known, so we don't need static code analysis
> for that. For types inside UDFs I can add that requirement to FLINK-1319.
>
>
>
> On 20.01.2015 11:51, Alexander Alexandrov wrote:
>
>> +1 for program analysis from me too...
>>
>> Should be doable also on a lower level (e.g. analysis of compiled *.class
>> files) with some off-the-shelf libraries, right?
>>
>> 2015-01-20 11:39 GMT+01:00 Till Rohrmann <[email protected]>:
>>
>>  I like the idea to automatically figure out which types are used by a
>>> program and to register them at Kryo. Thus, +1 for this idea.
>>>
>>> On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[email protected]>
>>> wrote:
>>>
>>>  @Stephan: Yes, you are summarizing it correctly.
>>>> I'll assign FLINK-1417 to myself and implement it as discussed here
>>>>
>>> (once I
>>>
>>>> have resolved the other issues assigned to me)
>>>>
>>>> There is one additional point we forgot in the discussion so far: We are
>>>> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
>>>> checked and its registering 51 default serializers and 81 registered
>>>>
>>> types.
>>>
>>>> I just talked to Till (who implemented the KryoSerializer originally)
>>>> and
>>>> he also suggests to pool kryo instances using thread-local variables.
>>>>
>>> I'll
>>>
>>>> look into that as well once FLINK-1417 has been resolved. I think that
>>>> helps to mitigate the heavy initialization of Kryo (which is inevitable)
>>>>
>>>>
>>>> @Arvid: Our POJO/Types support does explicitly not require our users to
>>>> implement any interfaces, so that option is not feasible.
>>>> The bytecode analysis is not part of the main flink distribution because
>>>>
>>> it
>>>
>>>> is using a library with an apache incompatible license.
>>>>
>>>>
>>>>
>>>> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[email protected]>
>>>> wrote:
>>>>
>>>>  An alternative way that may not be applicable in your case:
>>>>>
>>>>> For Sopremo, all types implemented a common interface. When a package
>>>>>
>>>> is
>>>
>>>> loaded, the Sopremo package manager scans the jar and looks for classes
>>>>> implementing the interfaces (quite fast, because not the entire class
>>>>>
>>>> must
>>>>
>>>>> be loaded). All types implementing the interface are automatically
>>>>>
>>>> added
>>>
>>>> to
>>>>
>>>>> Kryo with their respective annotated serializers.
>>>>>
>>>>> If you still have bytecode analysis of jobs, you can also statically
>>>>> determine all types that are used, check for default serializers, and
>>>>> maintain only a minimal set of serializers used for this specific job.
>>>>>
>>>> You
>>>>
>>>>> could already warn for unregistered types before submitting jobs.
>>>>>
>>>>> On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[email protected]>
>>>>>
>>>> wrote:
>>>
>>>> Yes, I agree that the Avro serializer should be available by default.
>>>>>>
>>>>> That
>>>>>
>>>>>> is one case of a typical type that should work out of the box, given
>>>>>>
>>>>> that
>>>>
>>>>> we support Avro file formats.
>>>>>>
>>>>>> Let me summarize how I understood that suggestion:
>>>>>>
>>>>>>   - We make Avro available by default by registering a default
>>>>>>
>>>>> serializer
>>>>
>>>>> for the SpecificBase
>>>>>>
>>>>>>   - We create a library of serializers. We do not register them by
>>>>>>
>>>>> default.
>>>>>
>>>>>>   - Via FLINK-1417, we analyze the types. For any (nested) type that
>>>>>>
>>>>> we
>>>
>>>> encounter for which we have a serializer in the library, we register
>>>>>>
>>>>> that
>>>>
>>>>> serializer as the default serializer. Also, for every (nested) type
>>>>>>
>>>>> we
>>>
>>>> encounter, we register a tag at Kryo.
>>>>>>
>>>>>> I like that, it should give a nice and smooth user experience.
>>>>>>
>>>>>> Greetings,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <
>>>>>>
>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>> thank you for putting our discussion to the mailing list. This is
>>>>>>>
>>>>>> indeed
>>>>>
>>>>>> where such discussions belong. For the others, we started
>>>>>>>
>>>>>> discussing
>>>
>>>> here:
>>>>>>
>>>>>>> https://github.com/apache/flink/pull/304
>>>>>>>
>>>>>>> I think there is one additional approach, which is probably close
>>>>>>>
>>>>>> to
>>>
>>>> (1):
>>>>>
>>>>>> We only register those serializers by default which we don't see in
>>>>>>>
>>>>>> the
>>>>
>>>>> pre-flight phase (I think right now thats only GenericData.Array
>>>>>>>
>>>>>> from
>>>
>>>> Avro).
>>>>>>> We would come across all the other classes (Jodatime, Protobuf,
>>>>>>>
>>>>>> Avro,
>>>
>>>> Thrift, ...) when traversing the class hierarchy, as proposed in
>>>>>>> FLINK-1417. With this approach, users get the best out-of-the box
>>>>>>> experience and the number of registered classes / serializers is
>>>>>>>
>>>>>> kept
>>>
>>>> at
>>>>>
>>>>>> a
>>>>>>
>>>>>>> minimum.
>>>>>>> We can still offer means to register additional serializers (I
>>>>>>>
>>>>>> think
>>>
>>>> thats
>>>>>>
>>>>>>> already merged to master).
>>>>>>>
>>>>>>> My main concern with this particular issue is a good out of the box
>>>>>>>
>>>>>> user
>>>>>
>>>>>> experience. If there is an issue with type serialization, users
>>>>>>>
>>>>>> will
>>>
>>>> notice
>>>>>>
>>>>>>> it very early. (In my experience people often have their existing
>>>>>>>
>>>>>> datatypes
>>>>>>
>>>>>>> they use with other systems, and they want to continue using them)
>>>>>>> Therefore, I want to put some effort into making it as good as
>>>>>>>
>>>>>> possible.
>>>>>
>>>>>> I
>>>>>>
>>>>>>> would actually sacrifice performance over stability/usability here.
>>>>>>>
>>>>>> Our
>>>>
>>>>> system is flexible enough to replace it later with a more efficient
>>>>>>> serialization if that becomes an issue. But maybe my suggestion
>>>>>>>
>>>>>> above
>>>
>>>> is
>>>>>
>>>>>> already sufficient.
>>>>>>>
>>>>>>> We could also think about introducing a configuration variable
>>>>>>>
>>>>>> which
>>>
>>>> allows
>>>>>>
>>>>>>> users to disable the default serializers.
>>>>>>>
>>>>>>>
>>>>>>> Regarding the second question: Is there a downside registering all
>>>>>>>
>>>>>> types
>>>>>
>>>>>> for tagging? We reduce the overall I/O which is good for
>>>>>>>
>>>>>> performance.
>>>
>>>> Best,
>>>>>>> Robert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>>>
>>>>>>>> We have various pending pull requests that add support for
>>>>>>>>
>>>>>>> certain
>>>
>>>> types
>>>>>>
>>>>>>> by
>>>>>>>
>>>>>>>> adding extra kryo serializers.
>>>>>>>>
>>>>>>>> I think we need to decide how we want to handle the support for
>>>>>>>>
>>>>>>> extra
>>>>
>>>>> types, because more are certainly to come.
>>>>>>>>
>>>>>>>> As I understand it, we have three broad options:
>>>>>>>>
>>>>>>>> (1)
>>>>>>>> Add as many serializers to Kryo by default as possible.
>>>>>>>>   Pro:
>>>>>>>>      - Many types work out of the box
>>>>>>>>   Contra:
>>>>>>>>      - We may eventually overload the kryo registry with
>>>>>>>>
>>>>>>> serializers
>>>
>>>>        that are not needed for most cases and suffer in
>>>>>>>>
>>>>>>> performance
>>>
>>>>      - It is hard to guess which types work out of the box
>>>>>>>>
>>>>>>> (intransparent)
>>>>>>
>>>>>>>
>>>>>>>> (2)
>>>>>>>> We create a collection of serializers and a registration util.
>>>>>>>> --------
>>>>>>>> val env = ExecutionEnvironemnt.getExecutionEnviroment()
>>>>>>>>
>>>>>>>> Serializers.registerProtoBufSerializers(env);
>>>>>>>> Serializers.registerJavaUtilSerializers(env);
>>>>>>>> ---------
>>>>>>>> Pro:
>>>>>>>>    - Easy for users
>>>>>>>>    - We can grow the set of supported types very large without
>>>>>>>>
>>>>>>> overloading
>>>>>>
>>>>>>> Kryo
>>>>>>>>    - It is transparent what gets registered
>>>>>>>>
>>>>>>>> Contra:
>>>>>>>>    - Not quite as convenient as if things just run
>>>>>>>>
>>>>>>>>
>>>>>>>> (3)
>>>>>>>> We do nothing and let the user create and register whatever is
>>>>>>>>
>>>>>>> needed.
>>>>>
>>>>>> We could have a library and utility for serializers for certain
>>>>>>>>
>>>>>>> libraries.
>>>>>>>
>>>>>>>> Users could use this to conveniently add serializers for the
>>>>>>>>
>>>>>>> libraries
>>>>>
>>>>>> they
>>>>>>>
>>>>>>>> use.
>>>>>>>> Pro:
>>>>>>>>    - Simple for us ;-)
>>>>>>>> Contra:
>>>>>>>>    - More repeated work for users
>>>>>>>>
>>>>>>>>
>>>>>>>> ========================
>>>>>>>>
>>>>>>>> For approach (1) and (2), there is an orthogonal question of
>>>>>>>>
>>>>>>> whether
>>>>
>>>>> we
>>>>>
>>>>>> want to simply register default serializers (that enable that
>>>>>>>>
>>>>>>> types
>>>
>>>> work)
>>>>>>
>>>>>>> or also register types for tags, to speed up the serialization of
>>>>>>>>
>>>>>>> those
>>>>>
>>>>>> types.
>>>>>>>>
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>

Re: Handling custom types with Kryo

Reply via email to