Re: Request to document the direct relationship between other configurations
Thanks Jungtaek! 2020년 2월 14일 (금) 오후 3:57, Jungtaek Lim 님이 작성: > OK I agree this is going forward as we decided the final goal and it seems > someone starts to work on. In the meanwhile we agree about documenting the > direct relationship, and which style to use is "open". > > Thanks again to initiate the discussion thread - this thread led the > following thread for the final goal. > > On Fri, Feb 14, 2020 at 1:43 PM Hyukjin Kwon wrote: > >> It's okay to just follow one prevailing style. The main point I would >> like to say is the fact that we should *document* the direct >> relationship of configurations. >> For this case specifically, I don't think there is so much point here to >> decide one hard requirement to follow for the mid-term. We have the final >> goal to reach. >> >> There are always many cases that requires committer's judgement. It is >> impossible to put every single item to a guide line in practice. >> So we usually trust their judgement in practice unless there's no >> explicit objection. Here, my explicit objection is about no documentation >> for the direct relationship >> between configurations. Seems people think we should document this in >> general as well. >> >> In practice, I don't particularly mind what style is used in fact. I >> vaguely guess the style isn't an issue to many other committers in general. >> If this is the case, let's open a separate thread to discuss to confirm >> one style - this wasn't the main issue I wanted to address in this thread. >> >> Shall we conclude this thread by deciding to document the direct >> relationship between configurations preferably in one prevailing style? >> >> >> 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim 님이 >> 작성: >> >>> Even spark.dynamicAllocation.* doesn't follow 2-2, right? It follows the >>> mix of 1 and 2-1, though 1 is even broken there. >>> >>> It doesn't matter If we just discuss about one-time decision - it may be >>> OK to not to be strict on consistency, though it's not ideal. The thing is >>> that these kind of "preferences" are making confusion on review phase: >>> reviewers provide different comments and try to "educate" contributors on >>> their preferences. Expectations for such cases heavily depends on who >>> is/are reviewers of PR - there's no value on education. >>> >>> The codebase is the reference of implicit rules/policies which would >>> apply to all contributors including newcomers. Let's just put our best >>> efforts on being consistent on codebase. (We should have consensus to do >>> this anyway.) >>> >>> >>> On Thu, Feb 13, 2020 at 12:44 PM Hyukjin Kwon >>> wrote: >>> I think it’s just fine as long as we’re consistent with the instances having the description, for instance: When true and ‘spark.xx.xx’ is enabled, … I think this is 2-2 in most cases so far. I think we can reference other configuration keys in another configuration documentation by using ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication problem in most cases. Being consistent with the current base is a general guidance if I am not mistaken. We have identified a problem and a good goal to reach. If we want to change, let's do it as our final goal. Otherwise, let's simplify it to reduce the overhead rather then having a policy for the mid-term specifically. 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이 작성: > I tend to agree that there should be a time to make thing be > consistent (and I'm very happy to see the new thread on discussion) and we > may want to take some practice in the interim. > > But for me it is not clear what is the practice in the interim. I > pointed out the problems of existing style and if we all agree the > problems > are valid then we may need to fix it before start using it widely. > > For me the major question is "where" to put - at least there're two > different approaches: > > 1) put it to the description of `.enable` and describe the range of > impact (e.g.) put the description of "spark.A.enable" saying it will > affect > the following configurations under "spark.A". > (I don't think it's good to enumerate all of affected configs, as > `spark.dynamicAllocation.enable` is telling it is fragile.) > > 2) put it to the description of all affected configurations and > describe which configuration is the prerequisite to let this be effective. > e.g. put it on all configurations under `spark.dynamicAllocation` > mentioning `spark.dynamicAllocation.enable` should be enabled to make this > be effective. > > (I intended to skip mentioning "cross reference". Let's be pragmatic.) > > 2) has also two ways to describe: > > 2-1) Just mention simply - like "When dynamic allocation is enabled,", > not pointing out the key to toggle. This hugely simplify the description, > though
Re: SparkGraph review process
This will not be Spark 3.0, no. On Fri, Feb 14, 2020 at 1:12 AM kant kodali wrote: > > any update on this? Is spark graph going to make it into Spark or no? > > On Mon, Oct 14, 2019 at 12:26 PM Holden Karau wrote: >> >> Maybe let’s ask the folks from Lightbend who helped with the previous scala >> upgrade for their thoughts? >> >> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li wrote: 1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk. >>> >>> >>> This concern is valid. I think we should start the vote to ensure the whole >>> community is aware of the risk and take the responsibility to maintain this >>> in the long term. >>> >>> Cheers, >>> >>> Xiao >>> >>> >>> Xiangrui Meng 于2019年10月4日周五 下午12:27写道: Hi all, I want to clarify my role first to avoid misunderstanding. I'm an individual contributor here. My work on the graph SPIP as well as other Spark features I contributed to are not associated with my employer. It became quite challenging for me to keep track of the graph SPIP work due to less available time at home. On retrospective, we should have involved more Spark devs and committers early on so there is no single point of failure, i.e., me. Hopefully it is not too late to fix. I summarize my thoughts here to help onboard other reviewers: 1. On the technical side, my main concern is the runtime dependency on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We came out with the solution to shade a few Scala libraries to avoid pollution. However, I'm not super confident that the approach is sustainable for two reasons: a) there exists no proper shading libraries for Scala, 2) We will have to wait for upgrades from those Scala libraries before we can upgrade Spark to use a newer Scala version. So it would be great if some Scala experts can help review the current implementation and help assess the risk. 2. Overloading helper methods. MLlib used to have several overloaded helper methods for each algorithm, which later became a major maintenance burden. Builders and setters/getters are more maintainable. I will comment again on the PR. 3. The proposed API partitions graph into sub-graphs, as described in the property graph model. It is unclear to me how it would affect query performance because it requires SQL optimizer to correctly recognize data from the same source and make execution efficient. 4. The feature, although originally targeted for Spark 3.0, should not be a Spark 3.0 release blocker because it doesn't require breaking changes. If we miss the code freeze deadline, we can introduce a build flag to exclude the module from the official release/distribution, and then make it default once the module is ready. 5. If unfortunately we still don't see sufficient committer reviews, I think the best option would be submitting the work to Apache Incubator instead to unblock the work. But maybe it is too earlier to discuss this option. It would be great if other committers can offer help on the review! Really appreciated! Best, Xiangrui On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg wrote: > > Hello dear Spark community > > We are the developers behind the SparkGraph SPIP, which is a project > created out of our work on openCypher Morpheus > (https://github.com/opencypher/morpheus). During this year we have > collaborated with mainly Xiangrui Meng of Databricks to define and > develop a new SparkGraph module based on our experience from working on > Morpheus. Morpheus - formerly known as "Cypher for Apache Spark" - has > been in development for over 3 years and matured in its API and > implementation. > > The SPIP work has been on hold for a period of time now, as priorities at > Databricks have changed which has occupied Xiangrui's time (as well as > other happenings). As you may know, the latest API PR > (https://github.com/apache/spark/pull/24851) is blocking us from moving > forward with the implementation. > > In an attempt to not lose track of this project we now reach out to you > to ask whether there are any Spark co
[DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Hi, All. I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good. First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas. Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following. [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`. (Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.) Here is the summary of the current status. ++-+---+ | SQL Progressing Engine | TRIM Syntax | TRIM Function | +--+ | Spark 3.0.0-preview/2 | O | O | | PostgreSQL | O | O | | Presto | O | O | | MySQL | O(*) | X | | Oracle | O | X | | MsSQL | O | X | | Terradata | O | X | | Hive | X | X | | Spark 1.6 ~ 2.2| X | X | | Spark 2.3 ~ 2.4| X | O(**) | ++-+---+ * MySQL doesn't allow multiple trim character * Spark 2.3 ~ 2.4 have the function in a different way. Here is the illustrative example of the problem. postgres=# SELECT trim('yxTomxx', 'xyz'); btrim --- Tom presto:default> SELECT trim('yxTomxx', 'xyz'); _col0 --- Tom spark-sql> SELECT trim('yxTomxx', 'xyz'); z Here is our history to fix the above issue. [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue 1. https://github.com/apache/spark/pull/24902 (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.) 2. https://github.com/apache/spark/pull/24907 (Merged 2019-06-20 for v2.3.4, but reverted) 3. https://github.com/apache/spark/pull/24908 (Merged 2019-06-21 for v2.4.4, but reverted) (2) and (3) were reverted before releases because we didn't want to fix that in the maintenance releases. Please see the following references of the decision. https://github.com/apache/spark/pull/24908#issuecomment-504799028 (2.3) https://github.com/apache/spark/pull/24907#issuecomment-504799021 (2.4) Now, there are some requests to revert SPARK-28093 and to keep these esoteric functions for backward compatibility and the following reason. > Reordering function parameters to match another system, > for a method that is otherwise working correctly, > sounds exactly like a cosmetic change to me. > How can we silently change the parameter of an existing SQL function? > I don't think this is a correctness issue as the SQL standard > doesn't say that the function signature have to be trim(srcStr, trimStr). The concern and the point of views make sense. My concerns are the followings. 1. This kind of esoteric differences are called `vendor-lock-in` stuffs in a negative way. - It's difficult for new users to understand. - It's hard to migrate between Apache Spark and the others. 2. Although we did our bests, Apache Spark SQL has not been enough always. 3. We need to do improvement in the future releases. In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 in order to keep this vendor-lock-in stuffs for backward-compatibility. What I want is building a consistent point of views in this category for the upcoming PR reviews. If we have a clear policy, we can save future community efforts in many ways. Bests, Dongjoon.
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax. This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun wrote: > Hi, All. > > I'm sending this email because the Apache Spark committers had better have > a consistent point of views for the upcoming PRs. And, the community policy > is the way to lead the community members transparently and clearly for a > long term good. > > First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache > Spark 3.0.0 is going to achieve many improvement in SQL areas. > > Especially, we invested lots of efforts to improve SQL ANSI compliance and > related test coverage for the future. One simple example is the following. > > [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax > > With this support, we are able to move away from one of esoteric Spark > function `TRIM/LTRIM/RTRIM`. > (Note that the above syntax is ANSI standard, but we have additional > non-standard these functions, too.) > > Here is the summary of the current status. > > ++-+---+ > | SQL Progressing Engine | TRIM Syntax | TRIM Function | > +--+ > | Spark 3.0.0-preview/2 | O | O | > | PostgreSQL | O | O | > | Presto | O | O | > | MySQL | O(*) | X | > | Oracle | O | X | > | MsSQL | O | X | > | Terradata | O | X | > | Hive | X | X | > | Spark 1.6 ~ 2.2| X | X | > | Spark 2.3 ~ 2.4| X | O(**) | > ++-+---+ > * MySQL doesn't allow multiple trim character > * Spark 2.3 ~ 2.4 have the function in a different way. > > Here is the illustrative example of the problem. > > postgres=# SELECT trim('yxTomxx', 'xyz'); > btrim > --- > Tom > > presto:default> SELECT trim('yxTomxx', 'xyz'); > _col0 > --- > Tom > > spark-sql> SELECT trim('yxTomxx', 'xyz'); > z > > Here is our history to fix the above issue. > > [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue > 1. https://github.com/apache/spark/pull/24902 >(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 > released.) > 2. https://github.com/apache/spark/pull/24907 >(Merged 2019-06-20 for v2.3.4, but reverted) > 3. https://github.com/apache/spark/pull/24908 >(Merged 2019-06-21 for v2.4.4, but reverted) > > (2) and (3) were reverted before releases because we didn't want to fix > that in the maintenance releases. Please see the following references of > the decision. > > https://github.com/apache/spark/pull/24908#issuecomment-504799028 > (2.3) > https://github.com/apache/spark/pull/24907#issuecomment-504799021 > (2.4) > > Now, there are some requests to revert SPARK-28093 and to keep these > esoteric functions for backward compatibility and the following reason. > > > Reordering function parameters to match another system, > > for a method that is otherwise working correctly, > > sounds exactly like a cosmetic change to me. > > > How can we silently change the parameter of an existing SQL function? > > I don't think this is a correctness issue as the SQL standard > > doesn't say that the function signature have to be trim(srcStr, > trimStr). > > The concern and the point of views make sense. > > My concerns are the followings. > > 1. This kind of esoteric differences are called `vendor-lock-in` > stuffs in a negative way. >- It's difficult for new users to understand. >- It's hard to migrate between Apache Spark and the others. > 2. Although we did our bests, Apache Spark SQL has not been enough > always. > 3. We need to do improvement in the future releases. > > In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 > in order to keep this vendor-lock-in stuffs for backward-compatibility. > > What I want is building a consistent point of views in this category for > the upcoming PR reviews. > > If we have a clear policy, we can save future community efforts in many > ways. > > Bests, > Dongjoon. >