Yeah, that's one of my point why I dont want to document this in the guide yet.
I would like to make sure dev people are on the same page that documenting is a better practice. I dont intend to force as a hard requirement; however, if that's pointed out, it should better to address. On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0...@gmail.com> wrote: > In general I think it's better to have more detailed documents, but we > don't have to force everyone to do it if the config name is structured. I > would +1 to document the relationship of we can't tell it from the config > names, e.g. spark.shuffle.service.enabled > and spark.dynamicAllocation.enabled. > > On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Also, I would like to hear other people' thoughts on here. Could I ask >> what you guys think about this in general? >> >> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >> >>> To do that, we should explicitly document such structured configuration >>> and implicit effect, which is currently missing. >>> I would be more than happy if we document such implied relationship, >>> *and* if we are very sure all configurations are structured correctly >>> coherently. >>> Until that point, I think it might be more practical to simply document >>> it for now. >>> >>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another >>> issue on practice - whether to duplicate description between configuration >>> code and doc. I have been asked to add description on configuration code >>> regardlessly, and existing codebase doesn't. This configuration is >>> widely-used one. >>> This is actually something we should fix too. in SQL configuration, now >>> we don't have such duplications as of >>> https://github.com/apache/spark/pull/27459 as it generates. We should >>> do it in other configurations. >>> >>> >>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>> 작성: >>> >>>> I'm looking into the case of `spark.dynamicAllocation` and this seems >>>> to be the thing to support my voice. >>>> >>>> >>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >>>> >>>> I don't disagree with adding "This requires >>>> spark.shuffle.service.enabled to be set." in the description of >>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, >>>> hence it should be better to have it. >>>> >>>> Why I'm in favor of structured configuration & implicit effect over >>>> describing everything explicitly is there. >>>> >>>> 1. There're 10 configurations (if the doc doesn't miss any other >>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4 >>>> configurations are referred in the description of >>>> `spark.dynamicAllocation.enabled` - majority of config keys are missing. >>>> 2. I think it's intentional, but the table starts >>>> with `spark.dynamicAllocation.enabled` which talks implicitly but >>>> intuitively that if you disable this then everything on dynamic allocation >>>> won't work. Missing majority of references on config keys don't get it hard >>>> to understand. >>>> 3. Even `spark.dynamicAllocation` has bad case - see >>>> `spark.dynamicAllocation.shuffleTracking.enabled` and >>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >>>> structure of configuration. I think this is worse than not explicitly >>>> mentioning the description. Let's assume the name has >>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive >>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` >>>> would effectively disable >>>> `spark.dynamicAllocation.shuffleTracking.timeout`? >>>> >>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue >>>> on practice - whether to duplicate description between configuration code >>>> and doc. I have been asked to add description on configuration code >>>> regardlessly, and existing codebase doesn't. This configuration is >>>> widely-used one. >>>> >>>> >>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> >>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do >>>>> it although it might be "redundant" :-) since anyone can give feedback to >>>>> any thread in Spark dev mailing list, and discuss. >>>>> >>>>> This is actually more prevailing given my rough reading of >>>>> configuration files. I would like to see this missing relationship as a >>>>> bad >>>>> pattern, started from a personal preference. >>>>> >>>>> > Personally I'd rather not think someone won't understand setting >>>>> `.enabled` to `false` means the functionality is disabled and effectively >>>>> it disables all sub-configurations. >>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me >>>>> this >>>>> is pretty intuitive and the one of major >>>>> > benefits of the structured configurations. >>>>> >>>>> I don't think this is a good idea we assume for users to know such >>>>> contexts. One might think >>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >>>>> partially enable the feature. It is better to be explicit to document >>>>> since some of configurations are even difficult for users to confirm if it >>>>> is working or not. >>>>> For instance, one might think setting >>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, >>>>> they realise the log is not rolling later after the file >>>>> size becomes bigger. >>>>> >>>>> >>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>>> 작성: >>>>> >>>>>> I'm sorry if I miss something, but this is ideally better to be >>>>>> started as [DISCUSS] as I haven't seen any reference to have consensus on >>>>>> this practice. >>>>>> >>>>>> For me it's just there're two different practices co-existing on the >>>>>> codebase, meaning it's closer to the preference of individual (with >>>>>> implicitly agreeing that others have different preferences), or it hasn't >>>>>> been discussed thoughtfully. >>>>>> >>>>>> Personally I'd rather not think someone won't understand setting >>>>>> `.enabled` to `false` means the functionality is disabled and effectively >>>>>> it disables all sub-configurations. E.g. when >>>>>> `spark.sql.adaptive.enabled` >>>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>>>>> implicitly no-op. For me this is pretty intuitive and the one of major >>>>>> benefits of the structured configurations. >>>>>> >>>>>> If we want to make it explicit, "all" sub-configurations should have >>>>>> redundant part of the doc. More redundant if the condition is nested. I >>>>>> agree this is the good step of "be kind" but less pragmatic. >>>>>> >>>>>> I'd be happy to follow the consensus we would make in this thread. >>>>>> Appreciate more voices. >>>>>> >>>>>> Thanks, >>>>>> Jungtaek Lim (HeartSaVioR) >>>>>> >>>>>> >>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> > I don't plan to document this officially yet >>>>>>> Just to prevent confusion, I meant I don't yet plan to document the >>>>>>> fact that we should write the relationships in configurations as a >>>>>>> code/review guideline in https://spark.apache.org/contributing.html >>>>>>> >>>>>>> >>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I happened to review some PRs and I noticed that some >>>>>>>> configurations don't have some information >>>>>>>> necessary. >>>>>>>> >>>>>>>> To be explicit, I would like to make sure we document the direct >>>>>>>> relationship between other configurations >>>>>>>> in the documentation. For example, >>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. >>>>>>>> That's clearly documented. >>>>>>>> We're good in general given that we document them in general in >>>>>>>> Apache Spark. >>>>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', >>>>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>>>> >>>>>>>> However, I noticed such a pattern that such information is missing >>>>>>>> in some components in general, for example, >>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>>>>> `spark.history.ui.acls.* ` >>>>>>>> >>>>>>>> I hope we all start to document such information. Logically users >>>>>>>> can't know the relationship and I myself >>>>>>>> had to read the codes to confirm when I review. >>>>>>>> I don't plan to document this officially yet because to me it looks >>>>>>>> a pretty logical request to me; however, >>>>>>>> let me know if you guys have some different opinions. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> >>>>>>>>