Re: Handling custom types with Kryo

Timo Walther Tue, 20 Jan 2015 03:28:22 -0800

Are we talking about the types for the input/output of operators or alsotypes that are used inside UDFs?Operator I/O type classes are known, so we don't need static codeanalysis for that. For types inside UDFs I can add that requirement toFLINK-1319.


On 20.01.2015 11:51, Alexander Alexandrov wrote:

+1 for program analysis from me too...

Should be doable also on a lower level (e.g. analysis of compiled *.class
files) with some off-the-shelf libraries, right?

2015-01-20 11:39 GMT+01:00 Till Rohrmann <[email protected]>:

I like the idea to automatically figure out which types are used by a
program and to register them at Kryo. Thus, +1 for this idea.

On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[email protected]>
wrote:

@Stephan: Yes, you are summarizing it correctly.
I'll assign FLINK-1417 to myself and implement it as discussed here

(once I

have resolved the other issues assigned to me)

There is one additional point we forgot in the discussion so far: We are
initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
checked and its registering 51 default serializers and 81 registered

types.

I just talked to Till (who implemented the KryoSerializer originally) and
he also suggests to pool kryo instances using thread-local variables.

I'll

look into that as well once FLINK-1417 has been resolved. I think that
helps to mitigate the heavy initialization of Kryo (which is inevitable)


@Arvid: Our POJO/Types support does explicitly not require our users to
implement any interfaces, so that option is not feasible.
The bytecode analysis is not part of the main flink distribution because

it

is using a library with an apache incompatible license.



On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[email protected]>
wrote:

An alternative way that may not be applicable in your case:

For Sopremo, all types implemented a common interface. When a package

is

loaded, the Sopremo package manager scans the jar and looks for classes
implementing the interfaces (quite fast, because not the entire class

must

be loaded). All types implementing the interface are automatically

added

to

Kryo with their respective annotated serializers.

If you still have bytecode analysis of jobs, you can also statically
determine all types that are used, check for default serializers, and
maintain only a minimal set of serializers used for this specific job.

You

could already warn for unregistered types before submitting jobs.

On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[email protected]>

wrote:

Yes, I agree that the Avro serializer should be available by default.

That

is one case of a typical type that should work out of the box, given

that

we support Avro file formats.

Let me summarize how I understood that suggestion:

  - We make Avro available by default by registering a default

serializer

for the SpecificBase

  - We create a library of serializers. We do not register them by

default.

  - Via FLINK-1417, we analyze the types. For any (nested) type that

we

encounter for which we have a serializer in the library, we register

that

serializer as the default serializer. Also, for every (nested) type

we

encounter, we register a tag at Kryo.

I like that, it should give a nice and smooth user experience.

Greetings,
Stephan




On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <

[email protected]>

wrote:

Hi,

thank you for putting our discussion to the mailing list. This is

indeed

where such discussions belong. For the others, we started

discussing

here:

https://github.com/apache/flink/pull/304

I think there is one additional approach, which is probably close

to

(1):

We only register those serializers by default which we don't see in

the

pre-flight phase (I think right now thats only GenericData.Array

from

Avro).
We would come across all the other classes (Jodatime, Protobuf,

Avro,

Thrift, ...) when traversing the class hierarchy, as proposed in
FLINK-1417. With this approach, users get the best out-of-the box
experience and the number of registered classes / serializers is

kept

at

minimum.
We can still offer means to register additional serializers (I

think

thats

already merged to master).

My main concern with this particular issue is a good out of the box

user

experience. If there is an issue with type serialization, users

will

notice

it very early. (In my experience people often have their existing

datatypes

they use with other systems, and they want to continue using them)
Therefore, I want to put some effort into making it as good as

possible.

would actually sacrifice performance over stability/usability here.

Our

system is flexible enough to replace it later with a more efficient
serialization if that becomes an issue. But maybe my suggestion

above

is

already sufficient.

We could also think about introducing a configuration variable

which

allows

users to disable the default serializers.


Regarding the second question: Is there a downside registering all

types

for tagging? We reduce the overall I/O which is good for

performance.

Best,
Robert



On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[email protected]>

wrote:

Hi all!

We have various pending pull requests that add support for

certain

types

by

adding extra kryo serializers.

I think we need to decide how we want to handle the support for

extra

types, because more are certainly to come.

As I understand it, we have three broad options:

(1)
Add as many serializers to Kryo by default as possible.
  Pro:
     - Many types work out of the box
  Contra:
     - We may eventually overload the kryo registry with

serializers

       that are not needed for most cases and suffer in

performance

     - It is hard to guess which types work out of the box

(intransparent)


(2)
We create a collection of serializers and a registration util.
--------
val env = ExecutionEnvironemnt.getExecutionEnviroment()

Serializers.registerProtoBufSerializers(env);
Serializers.registerJavaUtilSerializers(env);
---------
Pro:
   - Easy for users
   - We can grow the set of supported types very large without

overloading

Kryo
   - It is transparent what gets registered

Contra:
   - Not quite as convenient as if things just run


(3)
We do nothing and let the user create and register whatever is

needed.

We could have a library and utility for serializers for certain

libraries.

Users could use this to conveniently add serializers for the

libraries

they

use.
Pro:
   - Simple for us ;-)
Contra:
   - More repeated work for users


========================

For approach (1) and (2), there is an orthogonal question of

whether

we

want to simply register default serializers (that enable that

types

work)

or also register types for tags, to speed up the serialization of

those

types.


Greetings,
Stephan

Re: Handling custom types with Kryo

Reply via email to