I think it’s just fine as long as we’re consistent with the instances having the description, for instance:
When true and ‘spark.xx.xx’ is enabled, … I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases. Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach. If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically. 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 작성: > I tend to agree that there should be a time to make thing be consistent > (and I'm very happy to see the new thread on discussion) and we may want to > take some practice in the interim. > > But for me it is not clear what is the practice in the interim. I pointed > out the problems of existing style and if we all agree the problems are > valid then we may need to fix it before start using it widely. > > For me the major question is "where" to put - at least there're two > different approaches: > > 1) put it to the description of `.enable` and describe the range of impact > (e.g.) put the description of "spark.A.enable" saying it will affect the > following configurations under "spark.A". > (I don't think it's good to enumerate all of affected configs, as > `spark.dynamicAllocation.enable` is telling it is fragile.) > > 2) put it to the description of all affected configurations and describe > which configuration is the prerequisite to let this be effective. e.g. put > it on all configurations under `spark.dynamicAllocation` mentioning > `spark.dynamicAllocation.enable` should be enabled to make this be > effective. > > (I intended to skip mentioning "cross reference". Let's be pragmatic.) > > 2) has also two ways to describe: > > 2-1) Just mention simply - like "When dynamic allocation is enabled,", not > pointing out the key to toggle. This hugely simplify the description, > though end users may have to do the second guess to find the key to toggle. > (It'll be intuitive when we structurize the configurations.) > > 2-2) Mention the key to toggle directly - like "This is effective only if > spark.A.enable is set to true.". It's going to be longer depending on how > long the configuration key is. Personally I'd feel verbose unless the key > to toggle is not placed to the spot we can infer, but others may have > different opinions. > > I may be missing some details, so please participate to add the details. > Otherwise we may want to choose the best one, and have a sample form of > description. (Once we reach here, it may be OK to add to the contribution > doc, as that is the thing we agree about.) > > Without the details it's going to be a some sort of "preference" which is > natural to have disagreement, hence it doesn't make sense someone is forced > to do something if something turns out to be "preference". > > On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Adding those information is already a more prevailing style at this >> moment, and this is usual to follow prevailing side if there isn't a >> specific reason. >> If there is confusion about this, I will explicitly add it into the guide >> (https://spark.apache.org/contributing.html). Let me know if this still >> confuses or disagree. >> >> 2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >> >>> Yes, that's probably our final goal to revisit the configurations to >>> make it structured and deduplicated documentation cleanly. +1. >>> >>> One point I would like to add is though to add such information to the >>> documentation until we actually manage our final goal >>> since probably it's going to take a while to revisit/fix and make it >>> fully structured with the documentation. >>> If we go more conservatively, we can add such information to the new >>> configurations being added from now on at least, and keeping the existing >>> configurations as are. >>> >>> >>> On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Thank you for raising the issue, Hyukjin. >>>> >>>> According to the current status of discussion, it seems that we are >>>> able to agree on updating the non-structured configurations and keeping the >>>> structured configuration AS-IS. >>>> >>>> I'm +1 for the revisiting the configurations if that is our direction. >>>> If there is some mismatch in structured configurations, we may fix them >>>> together. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <dmat...@comcast.net> >>>> wrote: >>>> >>>>> All are valid and valuable observations to put into practice: >>>>> >>>>> * structured and meaningful config names >>>>> * explainable text or succinct description >>>>> * easily accessible or searchable >>>>> >>>>> While these are aspirational but gradually doable if we make it part >>>>> of the dev and review cycle. Often meaningful config names, like security, >>>>> come as after thought. >>>>> >>>>> At the AMA in Amsterdam Spark Summit last year, a few developers >>>>> lamented the lack of finding Spark configs—what they do; what they are >>>>> used >>>>> for; when and why; and what are their default values. >>>>> >>>>> Though you one fetch them programmatically, one still has to know what >>>>> specific config one islooking for. >>>>> >>>>> Cheers >>>>> Jules >>>>> >>>>> Sent from my iPhone >>>>> Pardon the dumb thumb typos :) >>>>> >>>>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: >>>>> >>>>> >>>>> Yeah, that's one of my point why I dont want to document this in the >>>>> guide yet. >>>>> >>>>> I would like to make sure dev people are on the same page that >>>>> documenting is a better practice. I dont intend to force as a hard >>>>> requirement; however, if that's pointed out, it should better to address. >>>>> >>>>> >>>>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0...@gmail.com> wrote: >>>>> >>>>>> In general I think it's better to have more detailed documents, but >>>>>> we don't have to force everyone to do it if the config name is >>>>>> structured. >>>>>> I would +1 to document the relationship of we can't tell it from the >>>>>> config >>>>>> names, e.g. spark.shuffle.service.enabled >>>>>> and spark.dynamicAllocation.enabled. >>>>>> >>>>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Also, I would like to hear other people' thoughts on here. Could I >>>>>>> ask what you guys think about this in general? >>>>>>> >>>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>>>> >>>>>>>> To do that, we should explicitly document such structured >>>>>>>> configuration and implicit effect, which is currently missing. >>>>>>>> I would be more than happy if we document such implied >>>>>>>> relationship, *and* if we are very sure all configurations are >>>>>>>> structured correctly coherently. >>>>>>>> Until that point, I think it might be more practical to simply >>>>>>>> document it for now. >>>>>>>> >>>>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another >>>>>>>> issue on practice - whether to duplicate description between >>>>>>>> configuration >>>>>>>> code and doc. I have been asked to add description on configuration >>>>>>>> code >>>>>>>> regardlessly, and existing codebase doesn't. This configuration is >>>>>>>> widely-used one. >>>>>>>> This is actually something we should fix too. in SQL configuration, >>>>>>>> now we don't have such duplications as of >>>>>>>> https://github.com/apache/spark/pull/27459 as it generates. We >>>>>>>> should do it in other configurations. >>>>>>>> >>>>>>>> >>>>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com>님이 작성: >>>>>>>> >>>>>>>>> I'm looking into the case of `spark.dynamicAllocation` and this >>>>>>>>> seems to be the thing to support my voice. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >>>>>>>>> >>>>>>>>> I don't disagree with adding "This requires >>>>>>>>> spark.shuffle.service.enabled to be set." in the description of >>>>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, >>>>>>>>> hence it should be better to have it. >>>>>>>>> >>>>>>>>> Why I'm in favor of structured configuration & implicit effect >>>>>>>>> over describing everything explicitly is there. >>>>>>>>> >>>>>>>>> 1. There're 10 configurations (if the doc doesn't miss any other >>>>>>>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4 >>>>>>>>> configurations are referred in the description of >>>>>>>>> `spark.dynamicAllocation.enabled` - majority of config keys are >>>>>>>>> missing. >>>>>>>>> 2. I think it's intentional, but the table starts >>>>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but >>>>>>>>> intuitively that if you disable this then everything on dynamic >>>>>>>>> allocation >>>>>>>>> won't work. Missing majority of references on config keys don't get >>>>>>>>> it hard >>>>>>>>> to understand. >>>>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see >>>>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and >>>>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >>>>>>>>> structure of configuration. I think this is worse than not explicitly >>>>>>>>> mentioning the description. Let's assume the name has >>>>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it >>>>>>>>> intuitive >>>>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to >>>>>>>>> `false` >>>>>>>>> would effectively disable >>>>>>>>> `spark.dynamicAllocation.shuffleTracking.timeout`? >>>>>>>>> >>>>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another >>>>>>>>> issue on practice - whether to duplicate description between >>>>>>>>> configuration >>>>>>>>> code and doc. I have been asked to add description on configuration >>>>>>>>> code >>>>>>>>> regardlessly, and existing codebase doesn't. This configuration is >>>>>>>>> widely-used one. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to >>>>>>>>>> do it although it might be "redundant" :-) since anyone can give >>>>>>>>>> feedback >>>>>>>>>> to any thread in Spark dev mailing list, and discuss. >>>>>>>>>> >>>>>>>>>> This is actually more prevailing given my rough reading of >>>>>>>>>> configuration files. I would like to see this missing relationship >>>>>>>>>> as a bad >>>>>>>>>> pattern, started from a personal preference. >>>>>>>>>> >>>>>>>>>> > Personally I'd rather not think someone won't understand >>>>>>>>>> setting `.enabled` to `false` means the functionality is disabled and >>>>>>>>>> effectively it disables all sub-configurations. >>>>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>>>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For >>>>>>>>>> me this >>>>>>>>>> is pretty intuitive and the one of major >>>>>>>>>> > benefits of the structured configurations. >>>>>>>>>> >>>>>>>>>> I don't think this is a good idea we assume for users to know >>>>>>>>>> such contexts. One might think >>>>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >>>>>>>>>> partially enable the feature. It is better to be explicit to >>>>>>>>>> document since some of configurations are even difficult for users to >>>>>>>>>> confirm if it is working or not. >>>>>>>>>> For instance, one might think setting >>>>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. >>>>>>>>>> Then, >>>>>>>>>> they realise the log is not rolling later after the file >>>>>>>>>> size becomes bigger. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim < >>>>>>>>>> kabhwan.opensou...@gmail.com>님이 작성: >>>>>>>>>> >>>>>>>>>>> I'm sorry if I miss something, but this is ideally better to be >>>>>>>>>>> started as [DISCUSS] as I haven't seen any reference to have >>>>>>>>>>> consensus on >>>>>>>>>>> this practice. >>>>>>>>>>> >>>>>>>>>>> For me it's just there're two different practices co-existing on >>>>>>>>>>> the codebase, meaning it's closer to the preference of individual >>>>>>>>>>> (with >>>>>>>>>>> implicitly agreeing that others have different preferences), or it >>>>>>>>>>> hasn't >>>>>>>>>>> been discussed thoughtfully. >>>>>>>>>>> >>>>>>>>>>> Personally I'd rather not think someone won't understand setting >>>>>>>>>>> `.enabled` to `false` means the functionality is disabled and >>>>>>>>>>> effectively >>>>>>>>>>> it disables all sub-configurations. E.g. when >>>>>>>>>>> `spark.sql.adaptive.enabled` >>>>>>>>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>>>>>>>>>> implicitly no-op. For me this is pretty intuitive and the one of >>>>>>>>>>> major >>>>>>>>>>> benefits of the structured configurations. >>>>>>>>>>> >>>>>>>>>>> If we want to make it explicit, "all" sub-configurations should >>>>>>>>>>> have redundant part of the doc. More redundant if the condition is >>>>>>>>>>> nested. >>>>>>>>>>> I agree this is the good step of "be kind" but less pragmatic. >>>>>>>>>>> >>>>>>>>>>> I'd be happy to follow the consensus we would make in this >>>>>>>>>>> thread. Appreciate more voices. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon < >>>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> > I don't plan to document this officially yet >>>>>>>>>>>> Just to prevent confusion, I meant I don't yet plan to document >>>>>>>>>>>> the fact that we should write the relationships in configurations >>>>>>>>>>>> as a >>>>>>>>>>>> code/review guideline in >>>>>>>>>>>> https://spark.apache.org/contributing.html >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 >>>>>>>>>>>> 작성: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I happened to review some PRs and I noticed that some >>>>>>>>>>>>> configurations don't have some information >>>>>>>>>>>>> necessary. >>>>>>>>>>>>> >>>>>>>>>>>>> To be explicit, I would like to make sure we document the >>>>>>>>>>>>> direct relationship between other configurations >>>>>>>>>>>>> in the documentation. For example, >>>>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is >>>>>>>>>>>>> enabled. That's clearly documented. >>>>>>>>>>>>> We're good in general given that we document them in general >>>>>>>>>>>>> in Apache Spark. >>>>>>>>>>>>> See 'spark.task.reaper.enabled', >>>>>>>>>>>>> 'spark.dynamicAllocation.enabled', >>>>>>>>>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>>>>>>>>> >>>>>>>>>>>>> However, I noticed such a pattern that such information is >>>>>>>>>>>>> missing in some components in general, for example, >>>>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>>>>>>>>>> `spark.history.ui.acls.* ` >>>>>>>>>>>>> >>>>>>>>>>>>> I hope we all start to document such information. Logically >>>>>>>>>>>>> users can't know the relationship and I myself >>>>>>>>>>>>> had to read the codes to confirm when I review. >>>>>>>>>>>>> I don't plan to document this officially yet because to me it >>>>>>>>>>>>> looks a pretty logical request to me; however, >>>>>>>>>>>>> let me know if you guys have some different opinions. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>