Thanks Jungtaek! 2020년 2월 14일 (금) 오후 3:57, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 작성:
> OK I agree this is going forward as we decided the final goal and it seems > someone starts to work on. In the meanwhile we agree about documenting the > direct relationship, and which style to use is "open". > > Thanks again to initiate the discussion thread - this thread led the > following thread for the final goal. > > On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> It's okay to just follow one prevailing style. The main point I would >> like to say is the fact that we should *document* the direct >> relationship of configurations. >> For this case specifically, I don't think there is so much point here to >> decide one hard requirement to follow for the mid-term. We have the final >> goal to reach. >> >> There are always many cases that requires committer's judgement. It is >> impossible to put every single item to a guide line in practice. >> So we usually trust their judgement in practice unless there's no >> explicit objection. Here, my explicit objection is about no documentation >> for the direct relationship >> between configurations. Seems people think we should document this in >> general as well. >> >> In practice, I don't particularly mind what style is used in fact. I >> vaguely guess the style isn't an issue to many other committers in general. >> If this is the case, let's open a separate thread to discuss to confirm >> one style - this wasn't the main issue I wanted to address in this thread. >> >> Shall we conclude this thread by deciding to document the direct >> relationship between configurations preferably in one prevailing style? >> >> >> 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >> 작성: >> >>> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the >>> mix of 1 and 2-1, though 1 is even broken there. >>> >>> It doesn't matter If we just discuss about one-time decision - it may be >>> OK to not to be strict on consistency, though it's not ideal. The thing is >>> that these kind of "preferences" are making confusion on review phase: >>> reviewers provide different comments and try to "educate" contributors on >>> their preferences. Expectations for such cases heavily depends on who >>> is/are reviewers of PR - there's no value on education. >>> >>> The codebase is the reference of implicit rules/policies which would >>> apply to all contributors including newcomers. Let's just put our best >>> efforts on being consistent on codebase. (We should have consensus to do >>> this anyway.) >>> >>> >>> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>>> I think it’s just fine as long as we’re consistent with the instances >>>> having the description, for instance: >>>> >>>> When true and ‘spark.xx.xx’ is enabled, … >>>> >>>> I think this is 2-2 in most cases so far. I think we can reference >>>> other configuration keys in another configuration documentation by using >>>> ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have >>>> duplication problem in most cases. >>>> >>>> Being consistent with the current base is a general guidance if I am >>>> not mistaken. We have identified a problem and a good goal to reach. >>>> If we want to change, let's do it as our final goal. Otherwise, let's >>>> simplify it to reduce the overhead rather then having a policy for the >>>> mid-term specifically. >>>> >>>> >>>> 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>> 작성: >>>> >>>>> I tend to agree that there should be a time to make thing be >>>>> consistent (and I'm very happy to see the new thread on discussion) and we >>>>> may want to take some practice in the interim. >>>>> >>>>> But for me it is not clear what is the practice in the interim. I >>>>> pointed out the problems of existing style and if we all agree the >>>>> problems >>>>> are valid then we may need to fix it before start using it widely. >>>>> >>>>> For me the major question is "where" to put - at least there're two >>>>> different approaches: >>>>> >>>>> 1) put it to the description of `.enable` and describe the range of >>>>> impact (e.g.) put the description of "spark.A.enable" saying it will >>>>> affect >>>>> the following configurations under "spark.A". >>>>> (I don't think it's good to enumerate all of affected configs, as >>>>> `spark.dynamicAllocation.enable` is telling it is fragile.) >>>>> >>>>> 2) put it to the description of all affected configurations and >>>>> describe which configuration is the prerequisite to let this be effective. >>>>> e.g. put it on all configurations under `spark.dynamicAllocation` >>>>> mentioning `spark.dynamicAllocation.enable` should be enabled to make this >>>>> be effective. >>>>> >>>>> (I intended to skip mentioning "cross reference". Let's be pragmatic.) >>>>> >>>>> 2) has also two ways to describe: >>>>> >>>>> 2-1) Just mention simply - like "When dynamic allocation is enabled,", >>>>> not pointing out the key to toggle. This hugely simplify the description, >>>>> though end users may have to do the second guess to find the key to >>>>> toggle. >>>>> (It'll be intuitive when we structurize the configurations.) >>>>> >>>>> 2-2) Mention the key to toggle directly - like "This is effective only >>>>> if spark.A.enable is set to true.". It's going to be longer depending on >>>>> how long the configuration key is. Personally I'd feel verbose unless the >>>>> key to toggle is not placed to the spot we can infer, but others may have >>>>> different opinions. >>>>> >>>>> I may be missing some details, so please participate to add the >>>>> details. Otherwise we may want to choose the best one, and have a sample >>>>> form of description. (Once we reach here, it may be OK to add to the >>>>> contribution doc, as that is the thing we agree about.) >>>>> >>>>> Without the details it's going to be a some sort of "preference" which >>>>> is natural to have disagreement, hence it doesn't make sense someone is >>>>> forced to do something if something turns out to be "preference". >>>>> >>>>> On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>> wrote: >>>>> >>>>>> Adding those information is already a more prevailing style at this >>>>>> moment, and this is usual to follow prevailing side if there isn't a >>>>>> specific reason. >>>>>> If there is confusion about this, I will explicitly add it into the >>>>>> guide (https://spark.apache.org/contributing.html). Let me know if >>>>>> this still confuses or disagree. >>>>>> >>>>>> 2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>>> >>>>>>> Yes, that's probably our final goal to revisit the configurations to >>>>>>> make it structured and deduplicated documentation cleanly. +1. >>>>>>> >>>>>>> One point I would like to add is though to add such information to >>>>>>> the documentation until we actually manage our final goal >>>>>>> since probably it's going to take a while to revisit/fix and make it >>>>>>> fully structured with the documentation. >>>>>>> If we go more conservatively, we can add such information to the new >>>>>>> configurations being added from now on at least, and keeping the >>>>>>> existing >>>>>>> configurations as are. >>>>>>> >>>>>>> >>>>>>> On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <dongjoon.h...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you for raising the issue, Hyukjin. >>>>>>>> >>>>>>>> According to the current status of discussion, it seems that we are >>>>>>>> able to agree on updating the non-structured configurations and >>>>>>>> keeping the >>>>>>>> structured configuration AS-IS. >>>>>>>> >>>>>>>> I'm +1 for the revisiting the configurations if that is our >>>>>>>> direction. If there is some mismatch in structured configurations, we >>>>>>>> may >>>>>>>> fix them together. >>>>>>>> >>>>>>>> Bests, >>>>>>>> Dongjoon. >>>>>>>> >>>>>>>> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <dmat...@comcast.net> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> All are valid and valuable observations to put into practice: >>>>>>>>> >>>>>>>>> * structured and meaningful config names >>>>>>>>> * explainable text or succinct description >>>>>>>>> * easily accessible or searchable >>>>>>>>> >>>>>>>>> While these are aspirational but gradually doable if we make it >>>>>>>>> part of the dev and review cycle. Often meaningful config names, like >>>>>>>>> security, come as after thought. >>>>>>>>> >>>>>>>>> At the AMA in Amsterdam Spark Summit last year, a few developers >>>>>>>>> lamented the lack of finding Spark configs—what they do; what they >>>>>>>>> are used >>>>>>>>> for; when and why; and what are their default values. >>>>>>>>> >>>>>>>>> Though you one fetch them programmatically, one still has to know >>>>>>>>> what specific config one islooking for. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Jules >>>>>>>>> >>>>>>>>> Sent from my iPhone >>>>>>>>> Pardon the dumb thumb typos :) >>>>>>>>> >>>>>>>>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Yeah, that's one of my point why I dont want to document this in >>>>>>>>> the guide yet. >>>>>>>>> >>>>>>>>> I would like to make sure dev people are on the same page that >>>>>>>>> documenting is a better practice. I dont intend to force as a hard >>>>>>>>> requirement; however, if that's pointed out, it should better to >>>>>>>>> address. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> In general I think it's better to have more detailed documents, >>>>>>>>>> but we don't have to force everyone to do it if the config name >>>>>>>>>> is structured. I would +1 to document the relationship of we can't >>>>>>>>>> tell it >>>>>>>>>> from the config names, e.g. spark.shuffle.service.enabled >>>>>>>>>> and spark.dynamicAllocation.enabled. >>>>>>>>>> >>>>>>>>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Also, I would like to hear other people' thoughts on here. Could >>>>>>>>>>> I ask what you guys think about this in general? >>>>>>>>>>> >>>>>>>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 >>>>>>>>>>> 작성: >>>>>>>>>>> >>>>>>>>>>>> To do that, we should explicitly document such structured >>>>>>>>>>>> configuration and implicit effect, which is currently missing. >>>>>>>>>>>> I would be more than happy if we document such implied >>>>>>>>>>>> relationship, *and* if we are very sure all configurations are >>>>>>>>>>>> structured correctly coherently. >>>>>>>>>>>> Until that point, I think it might be more practical to simply >>>>>>>>>>>> document it for now. >>>>>>>>>>>> >>>>>>>>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having >>>>>>>>>>>> another issue on practice - whether to duplicate description >>>>>>>>>>>> between >>>>>>>>>>>> configuration code and doc. I have been asked to add description on >>>>>>>>>>>> configuration code regardlessly, and existing codebase doesn't. >>>>>>>>>>>> This >>>>>>>>>>>> configuration is widely-used one. >>>>>>>>>>>> This is actually something we should fix too. in SQL >>>>>>>>>>>> configuration, now we don't have such duplications as of >>>>>>>>>>>> https://github.com/apache/spark/pull/27459 as it generates. We >>>>>>>>>>>> should do it in other configurations. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim < >>>>>>>>>>>> kabhwan.opensou...@gmail.com>님이 작성: >>>>>>>>>>>> >>>>>>>>>>>>> I'm looking into the case of `spark.dynamicAllocation` and >>>>>>>>>>>>> this seems to be the thing to support my voice. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >>>>>>>>>>>>> >>>>>>>>>>>>> I don't disagree with adding "This requires >>>>>>>>>>>>> spark.shuffle.service.enabled to be set." in the description of >>>>>>>>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred >>>>>>>>>>>>> implicitly, >>>>>>>>>>>>> hence it should be better to have it. >>>>>>>>>>>>> >>>>>>>>>>>>> Why I'm in favor of structured configuration & implicit effect >>>>>>>>>>>>> over describing everything explicitly is there. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. There're 10 configurations (if the doc doesn't miss any >>>>>>>>>>>>> other configuration) except `spark.dynamicAllocation.enabled`, >>>>>>>>>>>>> and only 4 >>>>>>>>>>>>> configurations are referred in the description of >>>>>>>>>>>>> `spark.dynamicAllocation.enabled` - majority of config keys are >>>>>>>>>>>>> missing. >>>>>>>>>>>>> 2. I think it's intentional, but the table starts >>>>>>>>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but >>>>>>>>>>>>> intuitively that if you disable this then everything on dynamic >>>>>>>>>>>>> allocation >>>>>>>>>>>>> won't work. Missing majority of references on config keys don't >>>>>>>>>>>>> get it hard >>>>>>>>>>>>> to understand. >>>>>>>>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see >>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and >>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >>>>>>>>>>>>> structure of configuration. I think this is worse than not >>>>>>>>>>>>> explicitly >>>>>>>>>>>>> mentioning the description. Let's assume the name has >>>>>>>>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it >>>>>>>>>>>>> intuitive >>>>>>>>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to >>>>>>>>>>>>> `false` >>>>>>>>>>>>> would effectively disable >>>>>>>>>>>>> `spark.dynamicAllocation.shuffleTracking.timeout`? >>>>>>>>>>>>> >>>>>>>>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having >>>>>>>>>>>>> another issue on practice - whether to duplicate description >>>>>>>>>>>>> between >>>>>>>>>>>>> configuration code and doc. I have been asked to add description >>>>>>>>>>>>> on >>>>>>>>>>>>> configuration code regardlessly, and existing codebase doesn't. >>>>>>>>>>>>> This >>>>>>>>>>>>> configuration is widely-used one. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon < >>>>>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I >>>>>>>>>>>>>> had to do it although it might be "redundant" :-) since anyone >>>>>>>>>>>>>> can give >>>>>>>>>>>>>> feedback to any thread in Spark dev mailing list, and discuss. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is actually more prevailing given my rough reading of >>>>>>>>>>>>>> configuration files. I would like to see this missing >>>>>>>>>>>>>> relationship as a bad >>>>>>>>>>>>>> pattern, started from a personal preference. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Personally I'd rather not think someone won't understand >>>>>>>>>>>>>> setting `.enabled` to `false` means the functionality is >>>>>>>>>>>>>> disabled and >>>>>>>>>>>>>> effectively it disables all sub-configurations. >>>>>>>>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>>>>>>>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. >>>>>>>>>>>>>> For me this >>>>>>>>>>>>>> is pretty intuitive and the one of major >>>>>>>>>>>>>> > benefits of the structured configurations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't think this is a good idea we assume for users to know >>>>>>>>>>>>>> such contexts. One might think >>>>>>>>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` >>>>>>>>>>>>>> can >>>>>>>>>>>>>> partially enable the feature. It is better to be explicit to >>>>>>>>>>>>>> document since some of configurations are even difficult for >>>>>>>>>>>>>> users to >>>>>>>>>>>>>> confirm if it is working or not. >>>>>>>>>>>>>> For instance, one might think setting >>>>>>>>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables >>>>>>>>>>>>>> rolling. Then, >>>>>>>>>>>>>> they realise the log is not rolling later after the file >>>>>>>>>>>>>> size becomes bigger. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim < >>>>>>>>>>>>>> kabhwan.opensou...@gmail.com>님이 작성: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm sorry if I miss something, but this is ideally better to >>>>>>>>>>>>>>> be started as [DISCUSS] as I haven't seen any reference to have >>>>>>>>>>>>>>> consensus >>>>>>>>>>>>>>> on this practice. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For me it's just there're two different practices >>>>>>>>>>>>>>> co-existing on the codebase, meaning it's closer to the >>>>>>>>>>>>>>> preference of >>>>>>>>>>>>>>> individual (with implicitly agreeing that others have different >>>>>>>>>>>>>>> preferences), or it hasn't been discussed thoughtfully. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Personally I'd rather not think someone won't understand >>>>>>>>>>>>>>> setting `.enabled` to `false` means the functionality is >>>>>>>>>>>>>>> disabled and >>>>>>>>>>>>>>> effectively it disables all sub-configurations. E.g. when >>>>>>>>>>>>>>> `spark.sql.adaptive.enabled` is `false`, all the configurations >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>> `spark.sql.adaptive.*` are implicitly no-op. For me this is >>>>>>>>>>>>>>> pretty >>>>>>>>>>>>>>> intuitive and the one of major benefits of the structured >>>>>>>>>>>>>>> configurations. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If we want to make it explicit, "all" sub-configurations >>>>>>>>>>>>>>> should have redundant part of the doc. More redundant if the >>>>>>>>>>>>>>> condition is >>>>>>>>>>>>>>> nested. I agree this is the good step of "be kind" but less >>>>>>>>>>>>>>> pragmatic. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'd be happy to follow the consensus we would make in this >>>>>>>>>>>>>>> thread. Appreciate more voices. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon < >>>>>>>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I don't plan to document this officially yet >>>>>>>>>>>>>>>> Just to prevent confusion, I meant I don't yet plan to >>>>>>>>>>>>>>>> document the fact that we should write the relationships in >>>>>>>>>>>>>>>> configurations >>>>>>>>>>>>>>>> as a code/review guideline in >>>>>>>>>>>>>>>> https://spark.apache.org/contributing.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 >>>>>>>>>>>>>>>> 작성: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I happened to review some PRs and I noticed that some >>>>>>>>>>>>>>>>> configurations don't have some information >>>>>>>>>>>>>>>>> necessary. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To be explicit, I would like to make sure we document the >>>>>>>>>>>>>>>>> direct relationship between other configurations >>>>>>>>>>>>>>>>> in the documentation. For example, >>>>>>>>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>>>>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is >>>>>>>>>>>>>>>>> enabled. That's clearly documented. >>>>>>>>>>>>>>>>> We're good in general given that we document them in >>>>>>>>>>>>>>>>> general in Apache Spark. >>>>>>>>>>>>>>>>> See 'spark.task.reaper.enabled', >>>>>>>>>>>>>>>>> 'spark.dynamicAllocation.enabled', >>>>>>>>>>>>>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However, I noticed such a pattern that such information is >>>>>>>>>>>>>>>>> missing in some components in general, for example, >>>>>>>>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` >>>>>>>>>>>>>>>>> and `spark.history.ui.acls.* ` >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I hope we all start to document such information. >>>>>>>>>>>>>>>>> Logically users can't know the relationship and I myself >>>>>>>>>>>>>>>>> had to read the codes to confirm when I review. >>>>>>>>>>>>>>>>> I don't plan to document this officially yet because to me >>>>>>>>>>>>>>>>> it looks a pretty logical request to me; however, >>>>>>>>>>>>>>>>> let me know if you guys have some different opinions. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>