I think we are talking about the Operator I/O types, as types used internally only by the UDFs should not be serialized.
2015-01-20 12:27 GMT+01:00 Timo Walther <twal...@apache.org>: > Are we talking about the types for the input/output of operators or also > types that are used inside UDFs? > Operator I/O type classes are known, so we don't need static code analysis > for that. For types inside UDFs I can add that requirement to FLINK-1319. > > > > On 20.01.2015 11:51, Alexander Alexandrov wrote: > >> +1 for program analysis from me too... >> >> Should be doable also on a lower level (e.g. analysis of compiled *.class >> files) with some off-the-shelf libraries, right? >> >> 2015-01-20 11:39 GMT+01:00 Till Rohrmann <till.rohrm...@gmail.com>: >> >> I like the idea to automatically figure out which types are used by a >>> program and to register them at Kryo. Thus, +1 for this idea. >>> >>> On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <rmetz...@apache.org> >>> wrote: >>> >>> @Stephan: Yes, you are summarizing it correctly. >>>> I'll assign FLINK-1417 to myself and implement it as discussed here >>>> >>> (once I >>> >>>> have resolved the other issues assigned to me) >>>> >>>> There is one additional point we forgot in the discussion so far: We are >>>> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just >>>> checked and its registering 51 default serializers and 81 registered >>>> >>> types. >>> >>>> I just talked to Till (who implemented the KryoSerializer originally) >>>> and >>>> he also suggests to pool kryo instances using thread-local variables. >>>> >>> I'll >>> >>>> look into that as well once FLINK-1417 has been resolved. I think that >>>> helps to mitigate the heavy initialization of Kryo (which is inevitable) >>>> >>>> >>>> @Arvid: Our POJO/Types support does explicitly not require our users to >>>> implement any interfaces, so that option is not feasible. >>>> The bytecode analysis is not part of the main flink distribution because >>>> >>> it >>> >>>> is using a library with an apache incompatible license. >>>> >>>> >>>> >>>> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <arvid.he...@gmail.com> >>>> wrote: >>>> >>>> An alternative way that may not be applicable in your case: >>>>> >>>>> For Sopremo, all types implemented a common interface. When a package >>>>> >>>> is >>> >>>> loaded, the Sopremo package manager scans the jar and looks for classes >>>>> implementing the interfaces (quite fast, because not the entire class >>>>> >>>> must >>>> >>>>> be loaded). All types implementing the interface are automatically >>>>> >>>> added >>> >>>> to >>>> >>>>> Kryo with their respective annotated serializers. >>>>> >>>>> If you still have bytecode analysis of jobs, you can also statically >>>>> determine all types that are used, check for default serializers, and >>>>> maintain only a minimal set of serializers used for this specific job. >>>>> >>>> You >>>> >>>>> could already warn for unregistered types before submitting jobs. >>>>> >>>>> On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <se...@apache.org> >>>>> >>>> wrote: >>> >>>> Yes, I agree that the Avro serializer should be available by default. >>>>>> >>>>> That >>>>> >>>>>> is one case of a typical type that should work out of the box, given >>>>>> >>>>> that >>>> >>>>> we support Avro file formats. >>>>>> >>>>>> Let me summarize how I understood that suggestion: >>>>>> >>>>>> - We make Avro available by default by registering a default >>>>>> >>>>> serializer >>>> >>>>> for the SpecificBase >>>>>> >>>>>> - We create a library of serializers. We do not register them by >>>>>> >>>>> default. >>>>> >>>>>> - Via FLINK-1417, we analyze the types. For any (nested) type that >>>>>> >>>>> we >>> >>>> encounter for which we have a serializer in the library, we register >>>>>> >>>>> that >>>> >>>>> serializer as the default serializer. Also, for every (nested) type >>>>>> >>>>> we >>> >>>> encounter, we register a tag at Kryo. >>>>>> >>>>>> I like that, it should give a nice and smooth user experience. >>>>>> >>>>>> Greetings, >>>>>> Stephan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger < >>>>>> >>>>> rmetz...@apache.org> >>> >>>> wrote: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> thank you for putting our discussion to the mailing list. This is >>>>>>> >>>>>> indeed >>>>> >>>>>> where such discussions belong. For the others, we started >>>>>>> >>>>>> discussing >>> >>>> here: >>>>>> >>>>>>> https://github.com/apache/flink/pull/304 >>>>>>> >>>>>>> I think there is one additional approach, which is probably close >>>>>>> >>>>>> to >>> >>>> (1): >>>>> >>>>>> We only register those serializers by default which we don't see in >>>>>>> >>>>>> the >>>> >>>>> pre-flight phase (I think right now thats only GenericData.Array >>>>>>> >>>>>> from >>> >>>> Avro). >>>>>>> We would come across all the other classes (Jodatime, Protobuf, >>>>>>> >>>>>> Avro, >>> >>>> Thrift, ...) when traversing the class hierarchy, as proposed in >>>>>>> FLINK-1417. With this approach, users get the best out-of-the box >>>>>>> experience and the number of registered classes / serializers is >>>>>>> >>>>>> kept >>> >>>> at >>>>> >>>>>> a >>>>>> >>>>>>> minimum. >>>>>>> We can still offer means to register additional serializers (I >>>>>>> >>>>>> think >>> >>>> thats >>>>>> >>>>>>> already merged to master). >>>>>>> >>>>>>> My main concern with this particular issue is a good out of the box >>>>>>> >>>>>> user >>>>> >>>>>> experience. If there is an issue with type serialization, users >>>>>>> >>>>>> will >>> >>>> notice >>>>>> >>>>>>> it very early. (In my experience people often have their existing >>>>>>> >>>>>> datatypes >>>>>> >>>>>>> they use with other systems, and they want to continue using them) >>>>>>> Therefore, I want to put some effort into making it as good as >>>>>>> >>>>>> possible. >>>>> >>>>>> I >>>>>> >>>>>>> would actually sacrifice performance over stability/usability here. >>>>>>> >>>>>> Our >>>> >>>>> system is flexible enough to replace it later with a more efficient >>>>>>> serialization if that becomes an issue. But maybe my suggestion >>>>>>> >>>>>> above >>> >>>> is >>>>> >>>>>> already sufficient. >>>>>>> >>>>>>> We could also think about introducing a configuration variable >>>>>>> >>>>>> which >>> >>>> allows >>>>>> >>>>>>> users to disable the default serializers. >>>>>>> >>>>>>> >>>>>>> Regarding the second question: Is there a downside registering all >>>>>>> >>>>>> types >>>>> >>>>>> for tagging? We reduce the overall I/O which is good for >>>>>>> >>>>>> performance. >>> >>>> Best, >>>>>>> Robert >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <se...@apache.org> >>>>>>> >>>>>> wrote: >>>>> >>>>>> Hi all! >>>>>>>> >>>>>>>> We have various pending pull requests that add support for >>>>>>>> >>>>>>> certain >>> >>>> types >>>>>> >>>>>>> by >>>>>>> >>>>>>>> adding extra kryo serializers. >>>>>>>> >>>>>>>> I think we need to decide how we want to handle the support for >>>>>>>> >>>>>>> extra >>>> >>>>> types, because more are certainly to come. >>>>>>>> >>>>>>>> As I understand it, we have three broad options: >>>>>>>> >>>>>>>> (1) >>>>>>>> Add as many serializers to Kryo by default as possible. >>>>>>>> Pro: >>>>>>>> - Many types work out of the box >>>>>>>> Contra: >>>>>>>> - We may eventually overload the kryo registry with >>>>>>>> >>>>>>> serializers >>> >>>> that are not needed for most cases and suffer in >>>>>>>> >>>>>>> performance >>> >>>> - It is hard to guess which types work out of the box >>>>>>>> >>>>>>> (intransparent) >>>>>> >>>>>>> >>>>>>>> (2) >>>>>>>> We create a collection of serializers and a registration util. >>>>>>>> -------- >>>>>>>> val env = ExecutionEnvironemnt.getExecutionEnviroment() >>>>>>>> >>>>>>>> Serializers.registerProtoBufSerializers(env); >>>>>>>> Serializers.registerJavaUtilSerializers(env); >>>>>>>> --------- >>>>>>>> Pro: >>>>>>>> - Easy for users >>>>>>>> - We can grow the set of supported types very large without >>>>>>>> >>>>>>> overloading >>>>>> >>>>>>> Kryo >>>>>>>> - It is transparent what gets registered >>>>>>>> >>>>>>>> Contra: >>>>>>>> - Not quite as convenient as if things just run >>>>>>>> >>>>>>>> >>>>>>>> (3) >>>>>>>> We do nothing and let the user create and register whatever is >>>>>>>> >>>>>>> needed. >>>>> >>>>>> We could have a library and utility for serializers for certain >>>>>>>> >>>>>>> libraries. >>>>>>> >>>>>>>> Users could use this to conveniently add serializers for the >>>>>>>> >>>>>>> libraries >>>>> >>>>>> they >>>>>>> >>>>>>>> use. >>>>>>>> Pro: >>>>>>>> - Simple for us ;-) >>>>>>>> Contra: >>>>>>>> - More repeated work for users >>>>>>>> >>>>>>>> >>>>>>>> ======================== >>>>>>>> >>>>>>>> For approach (1) and (2), there is an orthogonal question of >>>>>>>> >>>>>>> whether >>>> >>>>> we >>>>> >>>>>> want to simply register default serializers (that enable that >>>>>>>> >>>>>>> types >>> >>>> work) >>>>>> >>>>>>> or also register types for tags, to speed up the serialization of >>>>>>>> >>>>>>> those >>>>> >>>>>> types. >>>>>>>> >>>>>>>> >>>>>>>> Greetings, >>>>>>>> Stephan >>>>>>>> >>>>>>>> >