Re: Request to document the direct relationship between other configurations

Hyukjin Kwon Fri, 14 Feb 2020 01:46:14 -0800

Thanks Jungtaek!

2020년 2월 14일 (금) 오후 3:57, Jungtaek Lim <[email protected]>님이 작성:


> OK I agree this is going forward as we decided the final goal and it seems
> someone starts to work on. In the meanwhile we agree about documenting the
> direct relationship, and which style to use is "open".
>
> Thanks again to initiate the discussion thread - this thread led the
> following thread for the final goal.
>
> On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon <[email protected]> wrote:
>
>> It's okay to just follow one prevailing style. The main point I would
>> like to say is the fact that we should *document* the direct
>> relationship of configurations.
>> For this case specifically, I don't think there is so much point here to
>> decide one hard requirement to follow for the mid-term. We have the final
>> goal to reach.
>>
>> There are always many cases that requires committer's judgement. It is
>> impossible to put every single item to a guide line in practice.
>> So we usually trust their judgement in practice unless there's no
>> explicit objection. Here, my explicit objection is about no documentation
>> for the direct relationship
>> between configurations. Seems people think we should document this in
>> general as well.
>>
>> In practice, I don't particularly mind what style is used in fact. I
>> vaguely guess the style isn't an issue to many other committers in general.
>> If this is the case, let's open a separate thread to discuss to confirm
>> one style - this wasn't the main issue I wanted to address in this thread.
>>
>> Shall we conclude this thread by deciding to document the direct
>> relationship between configurations preferably in one prevailing style?
>>
>>
>> 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim <[email protected]>님이
>> 작성:
>>
>>> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the
>>> mix of 1 and 2-1, though 1 is even broken there.
>>>
>>> It doesn't matter If we just discuss about one-time decision - it may be
>>> OK to not to be strict on consistency, though it's not ideal. The thing is
>>> that these kind of "preferences" are making confusion on review phase:
>>> reviewers provide different comments and try to "educate" contributors on
>>> their preferences. Expectations for such cases heavily depends on who
>>> is/are reviewers of PR - there's no value on education.
>>>
>>> The codebase is the reference of implicit rules/policies which would
>>> apply to all contributors including newcomers. Let's just put our best
>>> efforts on being consistent on codebase. (We should have consensus to do
>>> this anyway.)
>>>
>>>
>>> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <[email protected]>
>>> wrote:
>>>
>>>> I think it’s just fine as long as we’re consistent with the instances
>>>> having the description, for instance:
>>>>
>>>>   When true and ‘spark.xx.xx’ is enabled, …
>>>>
>>>> I think this is 2-2 in most cases so far. I think we can reference
>>>> other configuration keys in another configuration documentation by using
>>>> ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have
>>>> duplication problem in most cases.
>>>>
>>>> Being consistent with the current base is a general guidance if I am
>>>> not mistaken. We have identified a problem and a good goal to reach.
>>>> If we want to change, let's do it as our final goal. Otherwise, let's
>>>> simplify it to reduce the overhead rather then having a policy for the
>>>> mid-term specifically.
>>>>
>>>>
>>>> 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <[email protected]>님이
>>>> 작성:
>>>>
>>>>> I tend to agree that there should be a time to make thing be
>>>>> consistent (and I'm very happy to see the new thread on discussion) and we
>>>>> may want to take some practice in the interim.
>>>>>
>>>>> But for me it is not clear what is the practice in the interim. I
>>>>> pointed out the problems of existing style and if we all agree the 
>>>>> problems
>>>>> are valid then we may need to fix it before start using it widely.
>>>>>
>>>>> For me the major question is "where" to put - at least there're two
>>>>> different approaches:
>>>>>
>>>>> 1) put it to the description of `.enable` and describe the range of
>>>>> impact (e.g.) put the description of "spark.A.enable" saying it will 
>>>>> affect
>>>>> the following configurations under "spark.A".
>>>>> (I don't think it's good to enumerate all of affected configs, as
>>>>> `spark.dynamicAllocation.enable` is telling it is fragile.)
>>>>>
>>>>> 2) put it to the description of all affected configurations and
>>>>> describe which configuration is the prerequisite to let this be effective.
>>>>> e.g. put it on all configurations under `spark.dynamicAllocation`
>>>>> mentioning `spark.dynamicAllocation.enable` should be enabled to make this
>>>>> be effective.
>>>>>
>>>>> (I intended to skip mentioning "cross reference". Let's be pragmatic.)
>>>>>
>>>>> 2) has also two ways to describe:
>>>>>
>>>>> 2-1) Just mention simply - like "When dynamic allocation is enabled,",
>>>>> not pointing out the key to toggle. This hugely simplify the description,
>>>>> though end users may have to do the second guess to find the key to 
>>>>> toggle.
>>>>> (It'll be intuitive when we structurize the configurations.)
>>>>>
>>>>> 2-2) Mention the key to toggle directly - like "This is effective only
>>>>> if spark.A.enable is set to true.". It's going to be longer depending on
>>>>> how long the configuration key is. Personally I'd feel verbose unless the
>>>>> key to toggle is not placed to the spot we can infer, but others may have
>>>>> different opinions.
>>>>>
>>>>> I may be missing some details, so please participate to add the
>>>>> details. Otherwise we may want to choose the best one, and have a sample
>>>>> form of description. (Once we reach here, it may be OK to add to the
>>>>> contribution doc, as that is the thing we agree about.)
>>>>>
>>>>> Without the details it's going to be a some sort of "preference" which
>>>>> is natural to have disagreement, hence it doesn't make sense someone is
>>>>> forced to do something if something turns out to be "preference".
>>>>>
>>>>> On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Adding those information is already a more prevailing style at this
>>>>>> moment, and this is usual to follow prevailing side if there isn't a
>>>>>> specific reason.
>>>>>> If there is confusion about this, I will explicitly add it into the
>>>>>> guide (https://spark.apache.org/contributing.html). Let me know if
>>>>>> this still confuses or disagree.
>>>>>>
>>>>>> 2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <[email protected]>님이 작성:
>>>>>>
>>>>>>> Yes, that's probably our final goal to revisit the configurations to
>>>>>>> make it structured and deduplicated documentation cleanly. +1.
>>>>>>>
>>>>>>> One point I would like to add is though to add such information to
>>>>>>> the documentation until we actually manage our final goal
>>>>>>> since probably it's going to take a while to revisit/fix and make it
>>>>>>> fully structured with the documentation.
>>>>>>> If we go more conservatively, we can add such information to the new
>>>>>>> configurations being added from now on at least, and keeping the 
>>>>>>> existing
>>>>>>> configurations as are.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for raising the issue, Hyukjin.
>>>>>>>>
>>>>>>>> According to the current status of discussion, it seems that we are
>>>>>>>> able to agree on updating the non-structured configurations and 
>>>>>>>> keeping the
>>>>>>>> structured configuration AS-IS.
>>>>>>>>
>>>>>>>> I'm +1 for the revisiting the configurations if that is our
>>>>>>>> direction. If there is some mismatch in structured configurations, we 
>>>>>>>> may
>>>>>>>> fix them together.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> All are valid and valuable observations to put into practice:
>>>>>>>>>
>>>>>>>>> * structured and meaningful config names
>>>>>>>>> * explainable text or succinct description
>>>>>>>>> * easily accessible or searchable
>>>>>>>>>
>>>>>>>>> While these are aspirational but gradually doable if we make it
>>>>>>>>> part of the dev and review cycle. Often meaningful config names, like
>>>>>>>>> security, come as after thought.
>>>>>>>>>
>>>>>>>>> At the AMA in Amsterdam Spark Summit last year, a few developers
>>>>>>>>> lamented the lack of finding Spark configs—what they do; what they 
>>>>>>>>> are used
>>>>>>>>> for; when and why; and what are their default values.
>>>>>>>>>
>>>>>>>>> Though you one fetch them programmatically, one still has to know
>>>>>>>>> what specific config one islooking for.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Jules
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>> Pardon the dumb thumb typos :)
>>>>>>>>>
>>>>>>>>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> Yeah, that's one of my point why I dont want to document this in
>>>>>>>>> the guide yet.
>>>>>>>>>
>>>>>>>>> I would like to make sure dev people are on the same page that
>>>>>>>>> documenting is a better practice. I dont intend to force as a hard
>>>>>>>>> requirement; however, if that's pointed out, it should better to 
>>>>>>>>> address.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In general I think it's better to have more detailed documents,
>>>>>>>>>> but we don't have to force everyone to do it if the config name
>>>>>>>>>> is structured. I would +1 to document the relationship of we can't 
>>>>>>>>>> tell it
>>>>>>>>>> from the config names, e.g. spark.shuffle.service.enabled
>>>>>>>>>> and spark.dynamicAllocation.enabled.
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Also, I would like to hear other people' thoughts on here. Could
>>>>>>>>>>> I ask what you guys think about this in general?
>>>>>>>>>>>
>>>>>>>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <[email protected]>님이
>>>>>>>>>>> 작성:
>>>>>>>>>>>
>>>>>>>>>>>> To do that, we should explicitly document such structured
>>>>>>>>>>>> configuration and implicit effect, which is currently missing.
>>>>>>>>>>>> I would be more than happy if we document such implied
>>>>>>>>>>>> relationship, *and* if we are very sure all configurations are
>>>>>>>>>>>> structured correctly coherently.
>>>>>>>>>>>> Until that point, I think it might be more practical to simply
>>>>>>>>>>>> document it for now.
>>>>>>>>>>>>
>>>>>>>>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having
>>>>>>>>>>>> another issue on practice - whether to duplicate description 
>>>>>>>>>>>> between
>>>>>>>>>>>> configuration code and doc. I have been asked to add description on
>>>>>>>>>>>> configuration code regardlessly, and existing codebase doesn't. 
>>>>>>>>>>>> This
>>>>>>>>>>>> configuration is widely-used one.
>>>>>>>>>>>> This is actually something we should fix too. in SQL
>>>>>>>>>>>> configuration, now we don't have such duplications as of
>>>>>>>>>>>> https://github.com/apache/spark/pull/27459 as it generates. We
>>>>>>>>>>>> should do it in other configurations.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <
>>>>>>>>>>>> [email protected]>님이 작성:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm looking into the case of `spark.dynamicAllocation` and
>>>>>>>>>>>>> this seems to be the thing to support my voice.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't disagree with adding "This requires
>>>>>>>>>>>>> spark.shuffle.service.enabled to be set." in the description of
>>>>>>>>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred 
>>>>>>>>>>>>> implicitly,
>>>>>>>>>>>>> hence it should be better to have it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Why I'm in favor of structured configuration & implicit effect
>>>>>>>>>>>>> over describing everything explicitly is there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. There're 10 configurations (if the doc doesn't miss any
>>>>>>>>>>>>> other configuration) except `spark.dynamicAllocation.enabled`, 
>>>>>>>>>>>>> and only 4
>>>>>>>>>>>>> configurations are referred in the description of
>>>>>>>>>>>>> `spark.dynamicAllocation.enabled` - majority of config keys are 
>>>>>>>>>>>>> missing.
>>>>>>>>>>>>> 2. I think it's intentional, but the table starts
>>>>>>>>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but
>>>>>>>>>>>>> intuitively that if you disable this then everything on dynamic 
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> won't work. Missing majority of references on config keys don't 
>>>>>>>>>>>>> get it hard
>>>>>>>>>>>>> to understand.
>>>>>>>>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see
>>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
>>>>>>>>>>>>> structure of configuration. I think this is worse than not 
>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>> mentioning the description. Let's assume the name has
>>>>>>>>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it 
>>>>>>>>>>>>> intuitive
>>>>>>>>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to 
>>>>>>>>>>>>> `false`
>>>>>>>>>>>>> would effectively disable 
>>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having
>>>>>>>>>>>>> another issue on practice - whether to duplicate description 
>>>>>>>>>>>>> between
>>>>>>>>>>>>> configuration code and doc. I have been asked to add description 
>>>>>>>>>>>>> on
>>>>>>>>>>>>> configuration code regardlessly, and existing codebase doesn't. 
>>>>>>>>>>>>> This
>>>>>>>>>>>>> configuration is widely-used one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I
>>>>>>>>>>>>>> had to do it although it might be "redundant" :-) since anyone 
>>>>>>>>>>>>>> can give
>>>>>>>>>>>>>> feedback to any thread in Spark dev mailing list, and discuss.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is actually more prevailing given my rough reading of
>>>>>>>>>>>>>> configuration files. I would like to see this missing 
>>>>>>>>>>>>>> relationship as a bad
>>>>>>>>>>>>>> pattern, started from a personal preference.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Personally I'd rather not think someone won't understand
>>>>>>>>>>>>>> setting `.enabled` to `false` means the functionality is 
>>>>>>>>>>>>>> disabled and
>>>>>>>>>>>>>> effectively it disables all sub-configurations.
>>>>>>>>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
>>>>>>>>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. 
>>>>>>>>>>>>>> For me this
>>>>>>>>>>>>>> is pretty intuitive and the one of major
>>>>>>>>>>>>>> > benefits of the structured configurations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think this is a good idea we assume for users to know
>>>>>>>>>>>>>> such contexts. One might think
>>>>>>>>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` 
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> partially enable the feature. It is better to be explicit to
>>>>>>>>>>>>>> document since some of configurations are even difficult for 
>>>>>>>>>>>>>> users to
>>>>>>>>>>>>>> confirm if it is working or not.
>>>>>>>>>>>>>> For instance, one might think setting
>>>>>>>>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables 
>>>>>>>>>>>>>> rolling. Then,
>>>>>>>>>>>>>> they realise the log is not rolling later after the file
>>>>>>>>>>>>>> size becomes bigger.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <
>>>>>>>>>>>>>> [email protected]>님이 작성:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm sorry if I miss something, but this is ideally better to
>>>>>>>>>>>>>>> be started as [DISCUSS] as I haven't seen any reference to have 
>>>>>>>>>>>>>>> consensus
>>>>>>>>>>>>>>> on this practice.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For me it's just there're two different practices
>>>>>>>>>>>>>>> co-existing on the codebase, meaning it's closer to the 
>>>>>>>>>>>>>>> preference of
>>>>>>>>>>>>>>> individual (with implicitly agreeing that others have different
>>>>>>>>>>>>>>> preferences), or it hasn't been discussed thoughtfully.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Personally I'd rather not think someone won't understand
>>>>>>>>>>>>>>> setting `.enabled` to `false` means the functionality is 
>>>>>>>>>>>>>>> disabled and
>>>>>>>>>>>>>>> effectively it disables all sub-configurations. E.g. when
>>>>>>>>>>>>>>> `spark.sql.adaptive.enabled` is `false`, all the configurations 
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> `spark.sql.adaptive.*` are implicitly no-op. For me this is 
>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>> intuitive and the one of major benefits of the structured 
>>>>>>>>>>>>>>> configurations.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we want to make it explicit, "all" sub-configurations
>>>>>>>>>>>>>>> should have redundant part of the doc. More redundant if the 
>>>>>>>>>>>>>>> condition is
>>>>>>>>>>>>>>> nested. I agree this is the good step of "be kind" but less 
>>>>>>>>>>>>>>> pragmatic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd be happy to follow the consensus we would make in this
>>>>>>>>>>>>>>> thread. Appreciate more voices.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I don't plan to document this officially yet
>>>>>>>>>>>>>>>> Just to prevent confusion, I meant I don't yet plan to
>>>>>>>>>>>>>>>> document the fact that we should write the relationships in 
>>>>>>>>>>>>>>>> configurations
>>>>>>>>>>>>>>>> as a code/review guideline in
>>>>>>>>>>>>>>>> https://spark.apache.org/contributing.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <[email protected]>님이
>>>>>>>>>>>>>>>> 작성:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I happened to review some PRs and I noticed that some
>>>>>>>>>>>>>>>>> configurations don't have some information
>>>>>>>>>>>>>>>>> necessary.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To be explicit, I would like to make sure we document the
>>>>>>>>>>>>>>>>> direct relationship between other configurations
>>>>>>>>>>>>>>>>> in the documentation. For example,
>>>>>>>>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>>>>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is
>>>>>>>>>>>>>>>>> enabled. That's clearly documented.
>>>>>>>>>>>>>>>>> We're good in general given that we document them in
>>>>>>>>>>>>>>>>> general in Apache Spark.
>>>>>>>>>>>>>>>>> See 'spark.task.reaper.enabled',
>>>>>>>>>>>>>>>>> 'spark.dynamicAllocation.enabled', 
>>>>>>>>>>>>>>>>> 'spark.sql.parquet.filterPushdown', etc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I noticed such a pattern that such information is
>>>>>>>>>>>>>>>>> missing in some components in general, for example,
>>>>>>>>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*`
>>>>>>>>>>>>>>>>> and `spark.history.ui.acls.* `
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I hope we all start to document such information.
>>>>>>>>>>>>>>>>> Logically users can't know the relationship and I myself
>>>>>>>>>>>>>>>>> had to read the codes to confirm when I review.
>>>>>>>>>>>>>>>>> I don't plan to document this officially yet because to me
>>>>>>>>>>>>>>>>> it looks a pretty logical request to me; however,
>>>>>>>>>>>>>>>>> let me know if you guys have some different opinions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Request to document the direct relationship between other configurations

Reply via email to