Re: Guarantees for object reuse modes and documentation

Fabian Hueske Thu, 25 Feb 2016 09:02:51 -0800

Gabor and Greg gave some good comments on the proposal.

If there is no more feedback, I'll go ahead and open a PR to update the
documentation tomorrow.


Thanks, Fabian

2016-02-24 12:24 GMT+01:00 Fabian Hueske <fhue...@gmail.com>:

> Regarding the scope of the object-reuse setting, I agree with Greg.
> It would be very nice if we could specify the object-reuse mode for each
> user function.
>
> Greg, do you want to open a JIRA for that such that we can continue the
> discussion there?
>
> 2016-02-24 12:07 GMT+01:00 Fabian Hueske <fhue...@gmail.com>:
>
>> Hi everybody,
>>
>> thanks for your input.
>>
>> I sketched a proposal for updated object-reuse semantics and
>> documentation, based on Gabor's proposal (1), Greg's input, and the changed
>> semantics that I discussed earlier in this thread.
>>
>> -->
>> https://docs.google.com/document/d/1jpPr2UuWlqq1iIDIo_1kmPL9QjA-sXAC9wkj-hE4PAc/edit#
>>
>> Looking forward to your comments.
>>
>> Fabian
>> (1)
>> https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg/edit
>>
>>
>> 2016-02-20 13:04 GMT+01:00 Gábor Gévay <gga...@gmail.com>:
>>
>>> Thanks, Ken! I was wondering how other systems handle these issues.
>>>
>>> Fortunately, the deep copy - shallow copy problem doesn't arise in
>>> Flink: when we copy an object, it is always a deep copy (at least, I
>>> hope so :)).
>>>
>>> Best,
>>> Gábor
>>>
>>>
>>>
>>> 2016-02-19 22:29 GMT+01:00 Ken Krugler <kkrugler_li...@transpac.com>:
>>> > Not sure how useful this is, but we'd run into similar issues with
>>> Cascading over the years.
>>> >
>>> > This wasn't an issue for input data, as Cascading "locks" the Tuple
>>> such that attempts to modify it will fail.
>>> >
>>> > And in general Hadoop always re-uses the data container being passed
>>> to operations, so you quickly learn to not cache those :)
>>> >
>>> > When trying to re-use a Tuple as the output in an operation, things
>>> get a bit more complicated.
>>> >
>>> > If the Tuple only contains primitive types, then there's no issue as
>>> the (effectively) shallow copy created by the execution platform doesn't
>>> create a problem.
>>> >
>>> > If the Tuple contains an object (e.g. a nested Tuple) then there were
>>> situations where a deep copy would need to be made before passing the Tuple
>>> to the operation's output collector.
>>> >
>>> > For example, if the next (chained) operation was a map-side
>>> aggregator, then a shallow copy of the Tuple would be cached. If there's a
>>> non-primitive object then changes to this in the upstream operation
>>> obviously bork the cached data.
>>> >
>>> > Net-net is that it we wanted a way to find out, from inside an
>>> operation, whether we needed to make a deep copy of the output Tuple. But
>>> that doesn't exist (yet), so we have some utility code to check if a deep
>>> copy is needed (non-primitive types), and if so then it auto-clones the
>>> Tuple. Which isn't very efficient, but for most of our workflows we only
>>> have primitive types.
>>> >
>>> > -- Ken
>>> >
>>> >> From: Fabian Hueske
>>> >> Sent: February 17, 2016 9:17:27am PST
>>> >> To: dev@flink.apache.org
>>> >> Subject: Guarantees for object reuse modes and documentation
>>> >>
>>> >> Hi,
>>> >>
>>> >>
>>> >>
>>> >> Flink's DataSet API features a configuration parameter called
>>> >> enableObjectReuse(). If activated, Flink's runtime will create fewer
>>> >> objects which results in better performance and lower garbage
>>> collection
>>> >> overhead. Depending on whether the configuration switch is enabled or
>>> not,
>>> >> user functions may or may not perform certain operations on objects
>>> they
>>> >> receive from Flink or emit to Flink.
>>> >>
>>> >>
>>> >>
>>> >> At the moment, there are quite a few open issues and discussions
>>> going on
>>> >> about the object reuse mode, including the JIRA issues FLINK-3333,
>>> >> FLINK-1521, FLINK-3335, FLINK-3340, FLINK-3394, and FLINK-3291.
>>> >>
>>> >>
>>> >>
>>> >> IMO, the most important issue is FLINK-3333 which is about improving
>>> the
>>> >> documentation of the object reuse mode. The current version [1] is
>>> >> ambiguous and includes details about operator chaining which are hard
>>> to
>>> >> understand and to reason about for users. Hence it is not very clear
>>> which
>>> >> guarantees Flink gives for objects in user functions under which
>>> >> conditions. This documentation needs to be improved and I think this
>>> should
>>> >> happen together with the 1.0 release.
>>> >>
>>> >>
>>> >>
>>> >> Greg and Gabor proposed two new versions:
>>> >>
>>> >> 1. Greg's version [2]  improves and clarifies the current
>>> documentation
>>> >> without significantly changing the semantics. It also discusses
>>> operator
>>> >> chaining, but gives more details.
>>> >> 2. Gabor's proposal [3] aims to make the discussion of object reuse
>>> >> independent of operator chaining which I think is a very good idea
>>> because
>>> >> it is not transparent to the user when function chaining happens.
>>> Gabor
>>> >> formulated four questions to answer what users can do with and expect
>>> from
>>> >> objects that they received or emitted from a function. In order to
>>> make the
>>> >> answers to these questions independent of function chaining and still
>>> keep
>>> >> the contracts as defined by the current documentation, we have to
>>> default
>>> >> to rather restrictive rules. For instance, functions must always emit
>>> new
>>> >> object instances in case of disabled object reuse mode. These strict
>>> rules
>>> >> would for example also require DataSourceFunctions to copy all records
>>> >> which they receive from an InputFormat (see FLINK-3335). IMO, the
>>> strict
>>> >> guarantees make the disableObjectReuse mode harder to use and reason
>>> about
>>> >> than the enableObjectReuse mode whereas the opposite should be the
>>> case.
>>> >>
>>> >>
>>> >>
>>> >> I would like to suggest a third option. Similar as Gabor, I think the
>>> rules
>>> >> should be independent of function chaining and I would like to break
>>> it
>>> >> down into a handful of easy rules. However, I think we should loosen
>>> up the
>>> >> guarantees for user functions under disableObjectReuse mode a bit.
>>> >>
>>> >> Right now, the documentation states that under enableObjectReuse mode,
>>> >> input objects are not changed across functions calls. Hence users can
>>> >> remember these objects across functions calls and their value will not
>>> >> change. I propose to give this guarantee only within functions calls
>>> and
>>> >> only for objects which are not emitted. Hence, this rule only applies
>>> for
>>> >> functions that can consume multiple values through an iterator such as
>>> >> GroupReduce, CoGroup, or MapPartition. In object disableObjectReuse
>>> mode,
>>> >> these functions are allowed to remember the values e.g., in a
>>> collection,
>>> >> and their value will not change when the iterator is forwarded. Once
>>> the
>>> >> function call returns, the values might change. Since  functions with
>>> >> iterators cannot be directly chained, it will be safe to emit the same
>>> >> object instance several times (hence FLINK-3335 would become invalid).
>>> >>
>>> >>
>>> >>
>>> >> The difference to the current guarantees is that input objects become
>>> >> invalid after the function call returned. Since, the
>>> disableObjectReuse
>>> >> mode was mainly introduced to allow for caching objects across
>>> iterator
>>> >> calls within a GroupReduceFunction or CoGroupFunction (not across
>>> function
>>> >> calls), I think this is a reasonable restriction.
>>> >>
>>> >>
>>> >>
>>> >> tl;dr;
>>> >>
>>> >> If we want to make the documentation of object reuse independent of
>>> >> chaining we have to
>>> >>
>>> >> - EITHER, give tighter guarantees / be more restrictive than now and
>>> update
>>> >> internals which might lead to performance regression. This would be
>>> in-line
>>> >> with the current documentation but somewhat defeat the purpose of the
>>> >> disabledObjectReuse mode, IMO.
>>> >>
>>> >> - OR, give weaker guarantees, which breaks with the current
>>> documentation,
>>> >> but would not affect performance or be easier to follow for users,
>>> IMO.
>>> >>
>>> >>
>>> >> Greg and Gabor, please correct me if I did not get your points right
>>> or
>>> >> missed something.
>>> >>
>>> >> What do others think?
>>> >>
>>> >>
>>> >> Fabian
>>> >>
>>> >>
>>> >>
>>> >> [1]
>>> >>
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
>>> >>
>>> >> [2]
>>> >>
>>> https://issues.apache.org/jira/browse/FLINK-3333?focusedCommentId=15139151
>>> >>
>>> >> [3]
>>> >>
>>> https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg
>>> >
>>> > --------------------------
>>> > Ken Krugler
>>> > +1 530-210-6378
>>> > http://www.scaleunlimited.com
>>> > custom big data solutions & training
>>> > Hadoop, Cascading, Cassandra & Solr
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --------------------------
>>> > Ken Krugler
>>> > +1 530-210-6378
>>> > http://www.scaleunlimited.com
>>> > custom big data solutions & training
>>> > Hadoop, Cascading, Cassandra & Solr
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>

Re: Guarantees for object reuse modes and documentation

Reply via email to