Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib
Dear Matei, thanks for the feedback! I used the setSeed option for all randomized classifiers and always used the same seeds for training with the hope that this deals with the non-determinism. I did not run any significance tests, because I was considering this from a functional perspective, assuming that the nondeterminism would be dealt with if I fix the seed values. The test results contain how many instances were classified differently. Sometimes these are only 1 or 2 out of 100 instances, i.e., almost certainly not significant. Other cases seem to be more interesting. For example, 20/100 instances were classified differently by the linear SVM for informative uniformly distributed data if we added 1 to each feature value. I know that these problems should sometimes be expected. However, I was actually not sure what to expect, especially after I started to look at the results for different ML libraries in comparison. The random forest are a good example. I expected them to be dependent on feature/instance order. However, they are not in Weka, only in scikit-learn and Spark MLlib. There are more such examples, like logistic regression that exhibits different behavior in all three libraries. Thus, I decided to just give my results to the people who know what to expect from their implementations, i.e., the devs. I will probably expand my test generator to allow more detailed specifications of the expectations of the algorithms in the future. This seems to be a "must" for a potentially productive use by projects. Relaxing the assertions to only react if the differences are significant would be another possible change. This could be a command line option to allow different strictness of testing. Best, Steffen Am 22.08.2018 um 23:27 schrieb Matei Zaharia: Hi Steffen, Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example: - The classification of LogisticRegression, DecisionTree, and RandomForest were not inverted when all binary class labels are flipped. - The classification of LogisticRegression, DecisionTree, GBT, and RandomForest sometimes changed when the features are reordered. - The classification of LogisticRegression, RandomForest, and LinearSVC sometimes changed when the instances are reordered. All of these things might occur because the algorithms are nondeterministic. Were the effects large or small? Or, for example, was the final difference in accuracy statistically significant? Many ML algorithms are trained using randomized algorithms like stochastic gradient descent, so you can’t expect exactly the same results under these changes. - The classification of NaïveBayes and the LinearSVC sometimes changed if one is added to each feature value. This might be due to nondeterminism as above, but it might also be due to regularization or nonlinear effects for some algorithms. For example, some algorithms might look at the relative values of features, in which case adding 1 to each feature value transforms the data. Other algorithms might require that data be centered around a mean of 0 to work best. I haven’t read the paper in detail, but basically it would be good to account for randomized algorithms as well as various model assumptions, and make sure the differences in results in these tests are statistically significant. Matei -- Dr. Steffen Herbold Institute of Computer Science University of Goettingen Goldschmidtstraße 7 37077 Göttingen, Germany mailto. herb...@cs.uni-goettingen.de tel. +49 551 39-172037 - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib
Behaviors at this level of detail, across different ML implementations, are highly unlikely to ever align exactly. Statistically small changes in logic, such as "<" versus "<=", or differences in random number generators, etc, (to say nothing of different implementation languages) will accumulate over training to yield different models, even if their overall performance should be similar. . The random forest are a good example. I expected them to be dependent on > feature/instance order. However, they are not in Weka, only in scikit-learn > and Spark MLlib. There are more such examples, like logistic regression > that exhibits different behavior in all three libraries. >
Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
Hi all, I usually follow https://github.com/databricks/scala-style-guide for Apache Spark's style, which is usually generally the same with the Spark's code base in practice. Thing is, we don't explicitly mention this within Apache Spark as far as I can tell. Can we explicitly mention this or port this style guide? It doesn't necessarily mean hard requirements for PRs or code changes but we could at least encourage people to read it.
Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
Seems OK to me. The style is pretty standard Scala style anyway. My guidance is always to follow the code around the code you're changing. On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon wrote: > Hi all, > > I usually follow https://github.com/databricks/scala-style-guide for > Apache Spark's style, which is usually generally the same with the Spark's > code base in practice. > Thing is, we don't explicitly mention this within Apache Spark as far as I > can tell. > > Can we explicitly mention this or port this style guide? It doesn't > necessarily mean hard requirements for PRs or code changes but we could at > least encourage people to read it. > >
Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
There’s already a code style guide listed on http://spark.apache.org/contributing.html. Maybe it’s the same? We should decide which one we actually want and update this page if it’s wrong. Matei > On Aug 23, 2018, at 6:33 PM, Sean Owen wrote: > > Seems OK to me. The style is pretty standard Scala style anyway. My guidance > is always to follow the code around the code you're changing. > > On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon wrote: > Hi all, > > I usually follow https://github.com/databricks/scala-style-guide for Apache > Spark's style, which is usually generally the same with the Spark's code base > in practice. > Thing is, we don't explicitly mention this within Apache Spark as far as I > can tell. > > Can we explicitly mention this or port this style guide? It doesn't > necessarily mean hard requirements for PRs or code changes but we could at > least encourage people to read it. > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
If you meant "Code Style Guide", many of them are missing and it refers https://docs.scala-lang.org/style/ not https://github.com/databricks/scala-style-guide (please correct me if I misunderstood). For instance, I lately guided 2 indents for line continuation but I found it's actually not in the official guide (which is rather usual in Spark's code base as far as I can tell FWIW). Can we just simply leave a link there instead? 2018년 8월 24일 (금) 오전 9:35, Matei Zaharia 님이 작성: > There’s already a code style guide listed on > http://spark.apache.org/contributing.html. Maybe it’s the same? We should > decide which one we actually want and update this page if it’s wrong. > > Matei > > > On Aug 23, 2018, at 6:33 PM, Sean Owen wrote: > > > > Seems OK to me. The style is pretty standard Scala style anyway. My > guidance is always to follow the code around the code you're changing. > > > > On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon > wrote: > > Hi all, > > > > I usually follow https://github.com/databricks/scala-style-guide for > Apache Spark's style, which is usually generally the same with the Spark's > code base in practice. > > Thing is, we don't explicitly mention this within Apache Spark as far as > I can tell. > > > > Can we explicitly mention this or port this style guide? It doesn't > necessarily mean hard requirements for PRs or code changes but we could at > least encourage people to read it. > > > >
Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
I wrote both the Spark one and later the Databricks one. The latter had a lot more work put into it and is consistent with the Spark style. I'd just use the second one and link to it, if possible. On Thu, Aug 23, 2018 at 6:38 PM Hyukjin Kwon wrote: > If you meant "Code Style Guide", many of them are missing and it refers > https://docs.scala-lang.org/style/ not > https://github.com/databricks/scala-style-guide (please correct me if I > misunderstood). > For instance, I lately guided 2 indents for line continuation but I found > it's actually not in the official guide (which is rather usual in Spark's > code base as far as I can tell FWIW). > > Can we just simply leave a link there instead? > > > 2018년 8월 24일 (금) 오전 9:35, Matei Zaharia 님이 작성: > >> There’s already a code style guide listed on >> http://spark.apache.org/contributing.html. Maybe it’s the same? We >> should decide which one we actually want and update this page if it’s wrong. >> >> Matei >> >> > On Aug 23, 2018, at 6:33 PM, Sean Owen wrote: >> > >> > Seems OK to me. The style is pretty standard Scala style anyway. My >> guidance is always to follow the code around the code you're changing. >> > >> > On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon >> wrote: >> > Hi all, >> > >> > I usually follow https://github.com/databricks/scala-style-guide for >> Apache Spark's style, which is usually generally the same with the Spark's >> code base in practice. >> > Thing is, we don't explicitly mention this within Apache Spark as far >> as I can tell. >> > >> > Can we explicitly mention this or port this style guide? It doesn't >> necessarily mean hard requirements for PRs or code changes but we could at >> least encourage people to read it. >> > >> >>
Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib
Yes, that makes sense, but just to be clear, using the same seed does *not* imply that the algorithm should produce “equivalent” results by some definition of equivalent if you change the input data. For example, in SGD, the random seed might be used to select the next minibatch of examples, but if you reorder the data or change the labels, this will result in a different gradient being computed. Just because the dataset transformation seems to preserve the ML problem at a high abstraction level does not mean that even a deterministic ML algorithm (MLlib with seed) will give the same result. Maybe other libraries do, but it doesn’t necessarily mean that MLlib is doing something wrong here. Basically, I’m just saying that as an ML library developer I wouldn’t be super concerned about these particular test results (especially if just a few instances change classification). I would be much more interested, however, in results like the following: - The algorithm’s evaluation metrics (loss, accuracy, etc) are statistically significant if you change these properties of the data. This probably requires you to run multiple times with different seeds. - MLlib’s evaluation metrics for a problem differ in a statistically significant way from other ML libraries, for algorithms configured with equivalent hyperparameters. (Sometimes libraries have different definitions for hyperparameters though). The second one is definitely something we’ve tested for informally in the past, though it is not in unit tests as far as I know. Matei > On Aug 23, 2018, at 5:14 AM, Steffen Herbold > wrote: > > Dear Matei, > > thanks for the feedback! > > I used the setSeed option for all randomized classifiers and always used the > same seeds for training with the hope that this deals with the > non-determinism. I did not run any significance tests, because I was > considering this from a functional perspective, assuming that the > nondeterminism would be dealt with if I fix the seed values. The test results > contain how many instances were classified differently. Sometimes these are > only 1 or 2 out of 100 instances, i.e., almost certainly not significant. > Other cases seem to be more interesting. For example, 20/100 instances were > classified differently by the linear SVM for informative uniformly > distributed data if we added 1 to each feature value. > > I know that these problems should sometimes be expected. However, I was > actually not sure what to expect, especially after I started to look at the > results for different ML libraries in comparison. The random forest are a > good example. I expected them to be dependent on feature/instance order. > However, they are not in Weka, only in scikit-learn and Spark MLlib. There > are more such examples, like logistic regression that exhibits different > behavior in all three libraries. Thus, I decided to just give my results to > the people who know what to expect from their implementations, i.e., the devs. > > I will probably expand my test generator to allow more detailed > specifications of the expectations of the algorithms in the future. This > seems to be a "must" for a potentially productive use by projects. Relaxing > the assertions to only react if the differences are significant would be > another possible change. This could be a command line option to allow > different strictness of testing. > > Best, > Steffen > > > Am 22.08.2018 um 23:27 schrieb Matei Zaharia: >> Hi Steffen, >> >> Thanks for sharing your results about MLlib — this sounds like a useful >> tool. However, I wanted to point out that some of the results may be >> expected for certain machine learning algorithms, so it might be good to >> design those tests with that in mind. For example: >> >>> - The classification of LogisticRegression, DecisionTree, and RandomForest >>> were not inverted when all binary class labels are flipped. >>> - The classification of LogisticRegression, DecisionTree, GBT, and >>> RandomForest sometimes changed when the features are reordered. >>> - The classification of LogisticRegression, RandomForest, and LinearSVC >>> sometimes changed when the instances are reordered. >> All of these things might occur because the algorithms are nondeterministic. >> Were the effects large or small? Or, for example, was the final difference >> in accuracy statistically significant? Many ML algorithms are trained using >> randomized algorithms like stochastic gradient descent, so you can’t expect >> exactly the same results under these changes. >> >>> - The classification of NaïveBayes and the LinearSVC sometimes changed if >>> one is added to each feature value. >> This might be due to nondeterminism as above, but it might also be due to >> regularization or nonlinear effects for some algorithms. For example, some >> algorithms might look at the relative values of features, in which case >> adding 1 to each feature value transforms
Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide
Will make a fix to the site. Thanks all. 2018년 8월 24일 (금) 오전 9:41, Reynold Xin 님이 작성: > I wrote both the Spark one and later the Databricks one. The latter had a > lot more work put into it and is consistent with the Spark style. I'd just > use the second one and link to it, if possible. > > > > On Thu, Aug 23, 2018 at 6:38 PM Hyukjin Kwon wrote: > >> If you meant "Code Style Guide", many of them are missing and it refers >> https://docs.scala-lang.org/style/ not >> https://github.com/databricks/scala-style-guide (please correct me if I >> misunderstood). >> For instance, I lately guided 2 indents for line continuation but I found >> it's actually not in the official guide (which is rather usual in Spark's >> code base as far as I can tell FWIW). >> >> Can we just simply leave a link there instead? >> >> >> 2018년 8월 24일 (금) 오전 9:35, Matei Zaharia 님이 작성: >> >>> There’s already a code style guide listed on >>> http://spark.apache.org/contributing.html. Maybe it’s the same? We >>> should decide which one we actually want and update this page if it’s wrong. >>> >>> Matei >>> >>> > On Aug 23, 2018, at 6:33 PM, Sean Owen wrote: >>> > >>> > Seems OK to me. The style is pretty standard Scala style anyway. My >>> guidance is always to follow the code around the code you're changing. >>> > >>> > On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon >>> wrote: >>> > Hi all, >>> > >>> > I usually follow https://github.com/databricks/scala-style-guide for >>> Apache Spark's style, which is usually generally the same with the Spark's >>> code base in practice. >>> > Thing is, we don't explicitly mention this within Apache Spark as far >>> as I can tell. >>> > >>> > Can we explicitly mention this or port this style guide? It doesn't >>> necessarily mean hard requirements for PRs or code changes but we could at >>> least encourage people to read it. >>> > >>> >>>