To do that, we should explicitly document such structured configuration and implicit effect, which is currently missing. I would be more than happy if we document such implied relationship, *and* if we are very sure all configurations are structured correctly coherently. Until that point, I think it might be more practical to simply document it for now.
> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on practice - whether to duplicate description between configuration code and doc. I have been asked to add description on configuration code regardlessly, and existing codebase doesn't. This configuration is widely-used one. This is actually something we should fix too. in SQL configuration, now we don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates. We should do it in other configurations. 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 작성: > I'm looking into the case of `spark.dynamicAllocation` and this seems to > be the thing to support my voice. > > > https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation > > I don't disagree with adding "This requires spark.shuffle.service.enabled > to be set." in the description of `spark.dynamicAllocation.enabled`. This > cannot be inferred implicitly, hence it should be better to have it. > > Why I'm in favor of structured configuration & implicit effect over > describing everything explicitly is there. > > 1. There're 10 configurations (if the doc doesn't miss any other > configuration) except `spark.dynamicAllocation.enabled`, and only 4 > configurations are referred in the description of > `spark.dynamicAllocation.enabled` - majority of config keys are missing. > 2. I think it's intentional, but the table starts > with `spark.dynamicAllocation.enabled` which talks implicitly but > intuitively that if you disable this then everything on dynamic allocation > won't work. Missing majority of references on config keys don't get it hard > to understand. > 3. Even `spark.dynamicAllocation` has bad case - see > `spark.dynamicAllocation.shuffleTracking.enabled` and > `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the > structure of configuration. I think this is worse than not explicitly > mentioning the description. Let's assume the name has > been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive > that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` > would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`? > > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on > practice - whether to duplicate description between configuration code and > doc. I have been asked to add description on configuration code > regardlessly, and existing codebase doesn't. This configuration is > widely-used one. > > > On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it >> although it might be "redundant" :-) since anyone can give feedback to any >> thread in Spark dev mailing list, and discuss. >> >> This is actually more prevailing given my rough reading of configuration >> files. I would like to see this missing relationship as a bad pattern, >> started from a personal preference. >> >> > Personally I'd rather not think someone won't understand setting >> `.enabled` to `false` means the functionality is disabled and effectively >> it disables all sub-configurations. >> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this >> is pretty intuitive and the one of major >> > benefits of the structured configurations. >> >> I don't think this is a good idea we assume for users to know such >> contexts. One might think >> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >> partially enable the feature. It is better to be explicit to document >> since some of configurations are even difficult for users to confirm if it >> is working or not. >> For instance, one might think setting >> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, >> they realise the log is not rolling later after the file >> size becomes bigger. >> >> >> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >> 작성: >> >>> I'm sorry if I miss something, but this is ideally better to be started >>> as [DISCUSS] as I haven't seen any reference to have consensus on this >>> practice. >>> >>> For me it's just there're two different practices co-existing on the >>> codebase, meaning it's closer to the preference of individual (with >>> implicitly agreeing that others have different preferences), or it hasn't >>> been discussed thoughtfully. >>> >>> Personally I'd rather not think someone won't understand setting >>> `.enabled` to `false` means the functionality is disabled and effectively >>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` >>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>> implicitly no-op. For me this is pretty intuitive and the one of major >>> benefits of the structured configurations. >>> >>> If we want to make it explicit, "all" sub-configurations should have >>> redundant part of the doc. More redundant if the condition is nested. I >>> agree this is the good step of "be kind" but less pragmatic. >>> >>> I'd be happy to follow the consensus we would make in this thread. >>> Appreciate more voices. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>>> > I don't plan to document this officially yet >>>> Just to prevent confusion, I meant I don't yet plan to document the >>>> fact that we should write the relationships in configurations as a >>>> code/review guideline in https://spark.apache.org/contributing.html >>>> >>>> >>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>> >>>>> Hi all, >>>>> >>>>> I happened to review some PRs and I noticed that some configurations >>>>> don't have some information >>>>> necessary. >>>>> >>>>> To be explicit, I would like to make sure we document the direct >>>>> relationship between other configurations >>>>> in the documentation. For example, >>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. >>>>> That's clearly documented. >>>>> We're good in general given that we document them in general in Apache >>>>> Spark. >>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', >>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>> >>>>> However, I noticed such a pattern that such information is missing in >>>>> some components in general, for example, >>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>> `spark.history.ui.acls.* ` >>>>> >>>>> I hope we all start to document such information. Logically users >>>>> can't know the relationship and I myself >>>>> had to read the codes to confirm when I review. >>>>> I don't plan to document this officially yet because to me it looks a >>>>> pretty logical request to me; however, >>>>> let me know if you guys have some different opinions. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>>