Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

Timo Walther Fri, 30 Aug 2019 07:04:59 -0700

Hi Becket,

1. First of all, you are totally right. The FLIP contains a bug due tothe last minute changes that Dawid suggested: by having immutableobjects created by a factory we loose the serializability of theConfiguration because the factory itself is not stored in theConfiguration. I would propose to revert the last change and stick tothe original design, which means that a object must implement theConfigurable interface and also implements serialization/deserializationmethods such that also internal fields can be persisted as yousuggested. But in general, people should not use any internal fields.Configurable objects are meant for simple little helper POJOs, notcomplex arbitrary nested data structures.

It is Map<String, Object> because Configuration stores the raw objects.If you put a Boolean option into it, it remains Boolean. This makes themap very efficient for shipping to the cluster and accessing optionsmultiple times. The same for configurable objects. We put the pureobjects into the map without any serialization/deserialization. Theprovided factory allows to convert the Object into a Configuration andwe know how to serialize/deserializise a configuration because it isjust a key/value map.

2. Yes, this is what we had in mind. It should still be the sameconfiguration option. We would like to avoid specialized option keysacross components (exec.max-para and table.exec.max-para) if they aredescribing basically the same thing. But adding some more descriptionlike "TableOptions.MAX_PARALLELISM with description_1 + description_2"does not hurt.

3. They should restore the original object given that thetoConfiguration/fromConfiguration methods have been implementedcorrectly. I will extend the example to make the logic clearer whilefixing the bug.


Thanks for the healthy discussion,
Timo


On 30.08.19 15:29, Becket Qin wrote:

Hi Timo,

Thanks again for the clarification. Please see a few more questions below.

Re: 1

Please also keep in mind that Configuration must not consist of only
strings, it manages a Map<String, Object> for efficient access. Every
map entry can have a string representation for persistence, but in most
cases consists of unserialized objects.


I'd like to understand this a bit more. The reason we have a Map<String,
Object> in Configuration was because that Object could be either a String,
a List or a Map, right? But they eventually always boil down to Strings, or
maybe one of the predefined type that we know how to serialize. In the
current design, can the Object also be Configurable?
If the value in the config Map<String, Object> can be Configurable objects,
how do we serialize them? Calling toConfiguration() seems not quite working
because there might be some other internal fields that are not part of the
configuration. The modification to those fields will be lost if we simply
use toConfiguration(). So the impact of modifying those Configurable
objects seems a little undefined. And it would be difficult to prevent
users from doing that.
If the value in the config Map<String, Object> cannot be Configurable
objects, then it seems a little weird to have all the other ConfigType
stored in the ConfigMap in their own native type and accessed via
getInteger() / getBoolean(), etc, while having ConfigurableType to be
different from others because one have to use ConfigurableFactory to get
the configurations.

Re: 2

I think about the withExtendedDescription as a helper getter in a
different place, so that the option is easier to find in a different
module from it was defined.
The MAX_PARALLELISM option in TableOptions would conceptually be equal to:
public ConfigOption getMaxParallelismOption() {
     return CoreOptions.MAX_PARALLELISM;
}

Just to make sure I understand it correctly, does that mean users will see
something like following?
  - CoreOptions.MAX_PARALLELISM with description_1;
  - TableOptions.MAX_PARALLELISM with description_1 + description_2.
  - DataStreamOptions.MAX_PARALLELISM with description_1 + description_3.
And users will only configure exactly one MAX_PARALLELISM cross the board.
So they won't be confused by setting two MAX_PARALLELISM config for two
different modules, while they are actually the same config. If that is the
case, I don't have further concern.

Re: 3
Maybe I am thinking too much. I thought toBytes() / fromBytes() actually
restore the original Object. But fromConfiguration() and toConfiguration()
does not do that, anything not in the configuration of the original object
will be lost. So it would be good to make that clear. Maybe a clear Java
doc is sufficient.

Thanks,

Jiangjie (Becket) Qin

On Fri, Aug 30, 2019 at 4:08 PM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

Hi,

Ad. 1

The advantage of our approach is that you have the type definition close
to the option definition. The only difference is that it enables
expressing simple pojos with the primitives like ConfigOption<Integer>,
ConfigOption<Long> etc. Otherwise as Timo said you will start having

the parsing logic scattered everywhere in the code base as it is now.
The string representation in our proposal is exactly the same as you
explained for those three options. The only difference is that you don't
have to parse the elements of a List, Map etc. afterwards.

Ad. 2

I think about the withExtendedDescription as a helper getter in a
different place, so that the option is easier to find in a different
module from it was defined.

The MAX_PARALLELISM option in TableOptions would conceptually be equal to:

public ConfigOption getMaxParallelismOption() {

     return CoreOptions.MAX_PARALLELISM;

}

This allows to further clarify the description of the option in the
context of a different module and end up in a seperate page in
documentation (but with a link to the original one). In the end it is
exactly the same option. It has exactly same key, type, parsing logic,
it is in the end forwarded to the same place.

Ad. 3

Not sure if I understand your concerns here. As Timo said it is in the
end sth similar to toBytes/fromBytes, but it puts itself to a
Configuration. Also just wanted to make sure we adjusted this part
slightly and now the ConfigOption takes ConfigurableFactory.

Best,

Dawid


On 30/08/2019 09:39, Timo Walther wrote:

Hi Becket,

thanks for the discussion.

1. ConfigOptions in their current design are bound to classes.
Regarding, the option is "creating some Configurable objects instead
of defining the config to create
those Configurable"? We just moved this logic to a factory, this
factory can then also be used for other purposes. However, how the
option and objects are serialized to Configuration is still not part
of the option. The option is just pure declaration.

If we would allow only List<String>, implementers would need to start
implementing own parsing and validation logic all the time. We would
like to avoid that.

Please also keep in mind that Configuration must not consist of only
strings, it manages a Map<String, Object> for efficient access. Every
map entry can have a string representation for persistence, but in
most cases consists of unserialized objects.

2. MAX_PARALLELISM is still defined just once. We don't overwrite
keys, types or default values. But different layers might want to add
helpful information. In our concrete use case for FLIP-59,
ExecutionConfig has 50 properties and many of them are not relevant
for the Table layer or have no effect at all. We would like to list
and mention the most important config options again in the Table
Configuration section, so that users are not confused, but with a
strong link to the core option. E.g.: registered kryo serializers are
also important also for Table users, we would like to add the comment
"This option allows to modify the serialization of the ANY SQL data
type.". I think we should not spam the core configuration page with
comments from all layers, connectors, or libraries but keep this in
the corresponding component documentation.

3. But it is something like fromBytes() and toBytes()? It serializes
and deserializes T from a configuration?

Regards,
Timo

On 29.08.19 19:14, Becket Qin wrote:

Hi Timo and Stephan,

Thanks for the detail explanation.

1. I agree that each config should be in a human readable format. My
concern is that the current List<Configurable> looks going a little
too far
from what the configuration is supposed to do. They are essentially
creating some Configurable objects instead of defining the config to
create
those Configurable. This mixes ConfigOption and the usage of it. API
wise
it would be good to keep the configs and their usages (such as how to
create objects using the ConfigOption) apart from each other.
I am wondering if we can just make List also only take string. For
example,
is the following definition of map and list sufficient?

A MapConfigOption is ConfigOption<Map<String, String>>. It can be
defined
as:
map_config_name: k1=v1, k2=v2, k3=v3, ...

A ListConfigOption is ConfigOption<List<String>>. It can be defined as:
list_config_name: v1, v2, v3, ...

A ListOfMapConfigOption is ConfigOption<List<Map<String, String>>. It
can
be defined as:
list_of_map_config_name: k1=v1, k2=v2; k3=v3, k4=v4;....

All the key and values in the configuration are String. This also
guarantees that the configuration is always serializable.
If we want to do one more step, we can allow the ConfigOption to set all
the primitive types and parse that for the users. So something like
List<Integer>, List<Class<?>> seems fine.

The configuration class could also have util methods to create a list of
configurable such as:
<T> List<T> <Configuration#getConfigurableInstances(ListMapConfigOption,
Class<T> clazz).
But the configuration class will not take arbitrary Configurable as the
value of its config.

2. I might have misunderstood this. But my concern on the description
extension is in the following example.

public static final ConfigOption<Integer> MAX_PARALLELISM =

CoreOptions.MAX_PARALLELISM.withExtendedDescription(
"Note: That this property means that a table program has a side-effect
XYZ.");

In this case, we will have two MAX_PARALLELISM configs now. One is
CoreOptions.MAX_PARALLELISM. The other one is defined here. I suppose
users
will see both configurations. One with an extended description and one
without. Let's say there is a third component which also users
MAX_PARALLELISM, will there be yet another MAX_PARALLELISM
ConfigOption? If
so, what would that ConfigOption's description look like?

Ideally, we would want to have just one CoreOptions.MAX_PARALLELISM
and the
description should clearly state all the usage of this ConfigOption.

3. I see, in that case, how about we name it something like
extractConfiguration()? I am just trying to see if we can make it clear
this is not something like fromBytes() and toBytes().

Thanks,

Jiangjie (Becket) Qin

On Thu, Aug 29, 2019 at 6:09 PM Timo Walther <twal...@apache.org>

wrote:

Hi Becket,

let me try to clarify some of your questions:

1. For every option, we also needed to think about how to represent it
in a human readable format. We do not want to allow arbitrary nesting
because that would easily allow to bypass the flattened hierarchy of
config options (`session.memory.min`). The current design allows to
represent every option type as a list. E.g.:

`myIntOption: 12` can be `myIntListOption: 12;12`
`myObjectOption: field=12,other=true` can be `myObjectListOption:
field=12,other=true; field=12,other=true`
`myPropertyOption: key=str0,other=str1` can be `myPropertyListOption:
key=str0,other=str1;key=str0,other=str1`

We need the atomic class for serialization/deserialization both in
binary and string format.

ConfigOption<List> is not present in the code base yet, but this
FLIP is
a preparation of making ExecutionConfig configurable. If you look into
this class or also in existing table connectors/formats, you will see
that each proposed option type has its requirements.

2. Regarding extending the description of ConfigOptions, the
semantic of
one option should be a super set of the other option. E.g. in Table API
we might use general ExecutionConfig properties. But we would like
to a)
make external options more prominent in the Table API config docs to
link people to properties they should pay attention b) notice about
side
effects. The core semantic of a property should not change.

3. The factory will not receive the entire configuration but works in a
separate key space. For `myObjectOption` above, it would receive a
configuration that consists of `field: 12` and `other: true`.

I agree. I will convert the document into a Wiki page today.

Thanks,
Timo

On 29.08.19 09:00, Stephan Ewen wrote:

@Becket One thing that may be non-obvious is that the Configuration
class
also defines serialization / persistence logic at the moment. So it
needs
to know the set of types it supports. That stands in the way of an
arbitrary generic map type.

@Timo I agree though that it seems a bit inconsistent to have one
collection orthogonal to the type (List) and another one bound to

specific

types (Map).

On Thu, Aug 29, 2019 at 8:20 AM Becket Qin <becket....@gmail.com>
wrote:

Hi Timo,

Thanks for the proposal. Sorry for the late comments, but I have a
few
questions / comments.

1. Is a new field of isList necessary in the ConfigOption?
Would it be enough to just check the atomicClass to see if it is a
List

or

not?
Also, in the ConfigOption<Map> class case, are we always assume
both key
and value types are String? Can we just apply the same to the
ConfigOption<List>?
BTW, I did a quick search in the codebase but did not find any
usage of
ConfigOption<List>.

2. The same config name, but with two ConfigOption with different

semantic

in different component seems super confusing. For example, when users

set

both configs, they may have no idea one is overriding the other.
There
might be two cases:
    - If it is just the same config used by different components to
act
accordingly, it might be better to just have one config, but describe
clearly on how that config will be used.
    - If it is actually two configurations that can be set
differently, I
think the config names should just be different.

3. Regarding the ConfigurableFactory, is the toConfiguration() method
pretty much means getConfiguration()? The toConfiguration() method

sounds

like converting an object to a configuration, which only works if the
object does not contain any state / value. I am also wondering if
there

is

a real use case of this method. Because supposedly the configurations

could

just be passed around to caller of this method.

Also, can you put the proposal into the FLIP wiki instead of in the

Google

doc before voting? The FLIP wiki allows track the modification
history

and

has a more established structure to ensure nothing is missed.

Thanks,

Jiangjie (Becket) Qin

On Tue, Aug 27, 2019 at 11:34 PM Timo Walther <twal...@apache.org>

wrote:

Hi everyone,

I updated the FLIP proposal one more time as mentioned in the voting
thread. If there are no objections, I will start a new voting thread
tomorrow at 9am Berlin time.

Thanks,
Timo


On 22.08.19 14:19, Timo Walther wrote:

Hi everyone,

thanks for all the feedback we have received online and offline. It
showed that many people support the idea of evolving the Flink
configuration functionality. I'm almost sure that this FLIP will
not
solve all issues but at least will improve the current status.

We've updated the document and replaced the Correlation part
with the
concept of a ConfigOptionGroup that can provide all available
options
of a group plus custom group validators for eager validation.
For now,
this eager group validation will only be used at certain
locations in
the Flink code but it prepares for maybe validating the entire
global
configuration before submitting a job in the future.

Please take another look if you find time. I hope we can proceed
with
the voting process if there are no objections.

Regards,
Timo

Am 19.08.19 um 12:54 schrieb Timo Walther:

Hi Stephan,

thanks for your suggestions. Let me give you some background about
the decisions made in this FLIP:

1. Goal: The FLIP is labelled "evolve" not "rework" because we did
not want to change the entire configuration infrastructure.
Both for
backwards-compatibility reasons and the amount of work that
would be
required to update all options. If our goal is to rework the
configuration option entirely, I might suggest to switch to JSON
format with JSON schema and JSON validator. However, setting
properties in a CLI or web interface becomes more tricky the more
nested structures are allowed.

2. Class-based Options: The current ConfigOption<T> class is
centered
around Java classes where T is determined by the default value.
The
FLIP just makes this more explicit by offering an explicit
`intType()` method etc. The current design of validators centered
around Java classes makes it possible to have typical domain
validators baked by generics as you suggested. If we introduce
types
such as "quantity with measure and unit" we still need to get a
class
out of this option at the end, so why changing a proven concept?

3. List Options: The `isList` prevents having arbitrary
nesting. As
Dawid mentioned, we kept human readability in mind. For every
atomic
option like "key=12" can be represented by a list "keys=12;13".
But
we don't want to go further; esp. no nesting. A dedicated list
option
would start making this more complicated such as
"ListOption(ObjectOption(ListOption(IntOption, ...),
StringOption(...)))", do we want that?

4. Correlation: The correlation part is one of the suggestions
that I
like least in the document. We can also discuss removing it
entirely,
but I think it solves the use case of relating options with each
other in a flexible way right next to the actual option.
Instead of
being hidden in some component initialization, we should put it
close
to the option to also perform validation eagerly instead of
failing
at runtime when the option is accessed the first time.

Regards,
Timo


Am 18.08.19 um 23:32 schrieb Stephan Ewen:

A "List Type" sounds like a good direction to me.

The comment on the type system was a bit brief, I agree. The
idea is
to see
if something like that can ease validation. Especially the

correlation

system seems quite complex (proxies to work around order of
initialization).

For example, let's assume we don't think primarily about "java
types" but
would define types as one of the following (just examples,
haven't
thought
all the details through):

      (a) category type: implies string, and a fix set of possible

values.

Those would be passes and naturally make it into the docs and
validation.
Maps to a String or Enum in Java.

      (b) numeric integer type: implies long (or optionally
integer,

if

we want
to automatically check overflow / underflow). would take typical

domain

validators, like non-negative, etc.

      (c) numeric real type: same as above (double or float)

      (d) numeric interval type: either defined as an interval, or
references
other parameter by key. validation by valid interval.

      (e) quantity: a measure and a unit. separately parsable. The
measure's
type could be any of the numeric types above, with same
validation
rules.

With a system like the above, would we still correlation
validators?
Are
there still cases that we need to catch early (config loading) or
are the
remaining cases sufficiently rare and runtime or setup specific,
that it is
fine to handle them in component initialization?


On Sun, Aug 18, 2019 at 6:36 PM Dawid Wysakowicz
<dwysakow...@apache.org>
wrote:

Hi Stephan,

Thank you for your opinion.

Actually list/composite types are the topics we spent the
most of

the

time. I understand that from a perspective of a full blown type
system,
a field like isList may look weird. Please let me elaborate a
bit

more

on the reason behind it though. Maybe we weren't clear enough
about

it

in the FLIP. The key feature of all the conifg options is
that they
must
have a string representation as they might come from a

configuration

file. Moreover it must be a human readable format, so that the

values

might be manually adjusted. Having that in mind we did not
want to
add a
support of an arbitrary nesting and we decided to allow for
lists

only

(and flat objects - I think though in the current design
there is a
mistake around the Configurable interface). I think though
you have

point here and it would be better to have a ListConfigOption
instead of
this field. Does it make sense to you?

As for the second part of your message. I am not sure if I

understood

it. The validators work with parse/deserialized values from
Configuration that means they can be bound to the generic
parameter

of

Configuration. You can have a RangeValidator<? extends
Comparable/Number>. I don't think the type hierarchy in the
ConfigOption
has anything to do with the validation logic. Could you
elaborate a
bit
more what did you mean?

Best,

Dawid

On 18/08/2019 16:42, Stephan Ewen wrote:

I like the idea of enhancing the configuration and to do early

validation.

I feel that some of the ideas in the FLIP seem a bit ad hoc,
though. For
example, having a boolean "isList" is a clear indication of not
having
thought through the type/category system.
Also, having a more clear category system makes validation

simpler.

For example, I have seen systems distinguishing between numeric

parameters

(valid ranges), category parameters (set of possible values),
quantities
like duration and memory size (need measure and unit), which
results in

an

elegant system for validation.


On Fri, Aug 16, 2019 at 5:22 PM JingsongLee <

lzljs3620...@aliyun.com

.invalid>

wrote:

+1 to this, thanks Timo and Dawid for the design.
This allows the currently cluttered configuration of various
     modules to be unified.
This is also first step of one of the keys to making new
unified
TableEnvironment available for production.

Previously, we did encounter complex configurations, such as
specifying the skewed values of column in DDL. The skew may
     be a single field or a combination of multiple fields.
So the
     configuration is very troublesome. We used JSON string to
     configure it.

Best,
Jingsong Lee

------------------------------------------------------------------

From:Jark Wu <imj...@gmail.com>
Send Time:2019年8月16日(星期五) 16:44
To:dev <dev@flink.apache.org>
Subject:Re: [DISCUSS] FLIP-54: Evolve ConfigOption and

Configuration

Thanks for starting this design Timo and Dawid,

Improving ConfigOption has been hovering in my mind for a long

time.

We have seen the benefit when developing blink
configurations and

connector

properties in 1.9 release.
Thanks for bringing it up and make such a detailed design.
I will leave my thoughts and comments there.

Cheers,
Jark


On Fri, 16 Aug 2019 at 22:30, Zili Chen <wander4...@gmail.com
wrote:

Hi Timo,

It looks interesting. Thanks for preparing this FLIP!

Client API enhancement benefit from this evolution which
hopefully provides a better view of configuration of Flink.
In client API enhancement, we likely make the deployment
of cluster and submission of job totally defined by

configuration.

Will take a look at the document in days.

Best,
tison.


Timo Walther <twal...@apache.org> 于2019年8月16日周五
下午10:12写道：

Hi everyone,

Dawid and I are working on making parts of
ExecutionConfig and
TableConfig configurable via config options. This is
necessary
to make
all properties also available in SQL. Additionally, with the
new SQL

DDL

based on properties as well as more connectors and formats
coming up,
unified configuration becomes more important.

We need more features around string-based configuration
in the
future,
which is why Dawid and I would like to propose FLIP-54 for
evolving

the

ConfigOption and Configuration classes:

https://docs.google.com/document/d/1IQ7nwXqmhCy900t2vQLEL3N2HIdMg-JO8vTzo1BtyKU/edit

In summary it adds:
- documented types and validation
- more common types such as memory size, duration, list
- simple non-nested object types

Looking forward to your feedback,
Timo

Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

Reply via email to