Hi, I like Fabian's proposal. The idea of object reuse is performance gain, and we should not sacrifice this. Even more important is that the rules are easy to understand!
-Matthias On 02/17/2016 06:17 PM, Fabian Hueske wrote: > Hi, > > > > Flink's DataSet API features a configuration parameter called > enableObjectReuse(). If activated, Flink's runtime will create fewer > objects which results in better performance and lower garbage collection > overhead. Depending on whether the configuration switch is enabled or not, > user functions may or may not perform certain operations on objects they > receive from Flink or emit to Flink. > > > > At the moment, there are quite a few open issues and discussions going on > about the object reuse mode, including the JIRA issues FLINK-3333, > FLINK-1521, FLINK-3335, FLINK-3340, FLINK-3394, and FLINK-3291. > > > > IMO, the most important issue is FLINK-3333 which is about improving the > documentation of the object reuse mode. The current version [1] is > ambiguous and includes details about operator chaining which are hard to > understand and to reason about for users. Hence it is not very clear which > guarantees Flink gives for objects in user functions under which > conditions. This documentation needs to be improved and I think this should > happen together with the 1.0 release. > > > > Greg and Gabor proposed two new versions: > > 1. Greg's version [2] improves and clarifies the current documentation > without significantly changing the semantics. It also discusses operator > chaining, but gives more details. > 2. Gabor's proposal [3] aims to make the discussion of object reuse > independent of operator chaining which I think is a very good idea because > it is not transparent to the user when function chaining happens. Gabor > formulated four questions to answer what users can do with and expect from > objects that they received or emitted from a function. In order to make the > answers to these questions independent of function chaining and still keep > the contracts as defined by the current documentation, we have to default > to rather restrictive rules. For instance, functions must always emit new > object instances in case of disabled object reuse mode. These strict rules > would for example also require DataSourceFunctions to copy all records > which they receive from an InputFormat (see FLINK-3335). IMO, the strict > guarantees make the disableObjectReuse mode harder to use and reason about > than the enableObjectReuse mode whereas the opposite should be the case. > > > > I would like to suggest a third option. Similar as Gabor, I think the rules > should be independent of function chaining and I would like to break it > down into a handful of easy rules. However, I think we should loosen up the > guarantees for user functions under disableObjectReuse mode a bit. > > Right now, the documentation states that under enableObjectReuse mode, > input objects are not changed across functions calls. Hence users can > remember these objects across functions calls and their value will not > change. I propose to give this guarantee only within functions calls and > only for objects which are not emitted. Hence, this rule only applies for > functions that can consume multiple values through an iterator such as > GroupReduce, CoGroup, or MapPartition. In object disableObjectReuse mode, > these functions are allowed to remember the values e.g., in a collection, > and their value will not change when the iterator is forwarded. Once the > function call returns, the values might change. Since functions with > iterators cannot be directly chained, it will be safe to emit the same > object instance several times (hence FLINK-3335 would become invalid). > > > > The difference to the current guarantees is that input objects become > invalid after the function call returned. Since, the disableObjectReuse > mode was mainly introduced to allow for caching objects across iterator > calls within a GroupReduceFunction or CoGroupFunction (not across function > calls), I think this is a reasonable restriction. > > > > tl;dr; > > If we want to make the documentation of object reuse independent of > chaining we have to > > - EITHER, give tighter guarantees / be more restrictive than now and update > internals which might lead to performance regression. This would be in-line > with the current documentation but somewhat defeat the purpose of the > disabledObjectReuse mode, IMO. > > - OR, give weaker guarantees, which breaks with the current documentation, > but would not affect performance or be easier to follow for users, IMO. > > > Greg and Gabor, please correct me if I did not get your points right or > missed something. > > What do others think? > > > Fabian > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior > > [2] > https://issues.apache.org/jira/browse/FLINK-3333?focusedCommentId=15139151 > > [3] > https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg >
signature.asc
Description: OpenPGP digital signature