Also, I would like to hear other people' thoughts on here. Could I ask what you guys think about this in general?
2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > To do that, we should explicitly document such structured configuration > and implicit effect, which is currently missing. > I would be more than happy if we document such implied relationship, *and* > if we are very sure all configurations are structured correctly coherently. > Until that point, I think it might be more practical to simply document it > for now. > > > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue > on practice - whether to duplicate description between configuration code > and doc. I have been asked to add description on configuration code > regardlessly, and existing codebase doesn't. This configuration is > widely-used one. > This is actually something we should fix too. in SQL configuration, now we > don't have such duplications as of > https://github.com/apache/spark/pull/27459 as it generates. We should do > it in other configurations. > > > 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 > 작성: > >> I'm looking into the case of `spark.dynamicAllocation` and this seems to >> be the thing to support my voice. >> >> >> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >> >> I don't disagree with adding "This requires spark.shuffle.service.enabled >> to be set." in the description of `spark.dynamicAllocation.enabled`. This >> cannot be inferred implicitly, hence it should be better to have it. >> >> Why I'm in favor of structured configuration & implicit effect over >> describing everything explicitly is there. >> >> 1. There're 10 configurations (if the doc doesn't miss any other >> configuration) except `spark.dynamicAllocation.enabled`, and only 4 >> configurations are referred in the description of >> `spark.dynamicAllocation.enabled` - majority of config keys are missing. >> 2. I think it's intentional, but the table starts >> with `spark.dynamicAllocation.enabled` which talks implicitly but >> intuitively that if you disable this then everything on dynamic allocation >> won't work. Missing majority of references on config keys don't get it hard >> to understand. >> 3. Even `spark.dynamicAllocation` has bad case - see >> `spark.dynamicAllocation.shuffleTracking.enabled` and >> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >> structure of configuration. I think this is worse than not explicitly >> mentioning the description. Let's assume the name has >> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive >> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false` >> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`? >> >> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue >> on practice - whether to duplicate description between configuration code >> and doc. I have been asked to add description on configuration code >> regardlessly, and existing codebase doesn't. This configuration is >> widely-used one. >> >> >> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it >>> although it might be "redundant" :-) since anyone can give feedback to any >>> thread in Spark dev mailing list, and discuss. >>> >>> This is actually more prevailing given my rough reading of configuration >>> files. I would like to see this missing relationship as a bad pattern, >>> started from a personal preference. >>> >>> > Personally I'd rather not think someone won't understand setting >>> `.enabled` to `false` means the functionality is disabled and effectively >>> it disables all sub-configurations. >>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this >>> is pretty intuitive and the one of major >>> > benefits of the structured configurations. >>> >>> I don't think this is a good idea we assume for users to know such >>> contexts. One might think >>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >>> partially enable the feature. It is better to be explicit to document >>> since some of configurations are even difficult for users to confirm if it >>> is working or not. >>> For instance, one might think setting >>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then, >>> they realise the log is not rolling later after the file >>> size becomes bigger. >>> >>> >>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>> 작성: >>> >>>> I'm sorry if I miss something, but this is ideally better to be started >>>> as [DISCUSS] as I haven't seen any reference to have consensus on this >>>> practice. >>>> >>>> For me it's just there're two different practices co-existing on the >>>> codebase, meaning it's closer to the preference of individual (with >>>> implicitly agreeing that others have different preferences), or it hasn't >>>> been discussed thoughtfully. >>>> >>>> Personally I'd rather not think someone won't understand setting >>>> `.enabled` to `false` means the functionality is disabled and effectively >>>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled` >>>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>>> implicitly no-op. For me this is pretty intuitive and the one of major >>>> benefits of the structured configurations. >>>> >>>> If we want to make it explicit, "all" sub-configurations should have >>>> redundant part of the doc. More redundant if the condition is nested. I >>>> agree this is the good step of "be kind" but less pragmatic. >>>> >>>> I'd be happy to follow the consensus we would make in this thread. >>>> Appreciate more voices. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> >>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> >>>>> > I don't plan to document this officially yet >>>>> Just to prevent confusion, I meant I don't yet plan to document the >>>>> fact that we should write the relationships in configurations as a >>>>> code/review guideline in https://spark.apache.org/contributing.html >>>>> >>>>> >>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I happened to review some PRs and I noticed that some configurations >>>>>> don't have some information >>>>>> necessary. >>>>>> >>>>>> To be explicit, I would like to make sure we document the direct >>>>>> relationship between other configurations >>>>>> in the documentation. For example, >>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. >>>>>> That's clearly documented. >>>>>> We're good in general given that we document them in general in >>>>>> Apache Spark. >>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled', >>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>> >>>>>> However, I noticed such a pattern that such information is missing in >>>>>> some components in general, for example, >>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>>> `spark.history.ui.acls.* ` >>>>>> >>>>>> I hope we all start to document such information. Logically users >>>>>> can't know the relationship and I myself >>>>>> had to read the codes to confirm when I review. >>>>>> I don't plan to document this officially yet because to me it looks a >>>>>> pretty logical request to me; however, >>>>>> let me know if you guys have some different opinions. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>>