Dear Matei,
thanks for the feedback!
I used the setSeed option for all randomized classifiers and always used
the same seeds for training with the hope that this deals with the
non-determinism. I did not run any significance tests, because I was
considering this from a functional perspective, assuming that the
nondeterminism would be dealt with if I fix the seed values. The test
results contain how many instances were classified differently.
Sometimes these are only 1 or 2 out of 100 instances, i.e., almost
certainly not significant. Other cases seem to be more interesting. For
example, 20/100 instances were classified differently by the linear SVM
for informative uniformly distributed data if we added 1 to each feature
value.
I know that these problems should sometimes be expected. However, I was
actually not sure what to expect, especially after I started to look at
the results for different ML libraries in comparison. The random forest
are a good example. I expected them to be dependent on feature/instance
order. However, they are not in Weka, only in scikit-learn and Spark
MLlib. There are more such examples, like logistic regression that
exhibits different behavior in all three libraries. Thus, I decided to
just give my results to the people who know what to expect from their
implementations, i.e., the devs.
I will probably expand my test generator to allow more detailed
specifications of the expectations of the algorithms in the future. This
seems to be a "must" for a potentially productive use by projects.
Relaxing the assertions to only react if the differences are significant
would be another possible change. This could be a command line option to
allow different strictness of testing.
Best,
Steffen
Am 22.08.2018 um 23:27 schrieb Matei Zaharia:
Hi Steffen,
Thanks for sharing your results about MLlib — this sounds like a useful tool.
However, I wanted to point out that some of the results may be expected for
certain machine learning algorithms, so it might be good to design those tests
with that in mind. For example:
- The classification of LogisticRegression, DecisionTree, and RandomForest were
not inverted when all binary class labels are flipped.
- The classification of LogisticRegression, DecisionTree, GBT, and RandomForest
sometimes changed when the features are reordered.
- The classification of LogisticRegression, RandomForest, and LinearSVC
sometimes changed when the instances are reordered.
All of these things might occur because the algorithms are nondeterministic.
Were the effects large or small? Or, for example, was the final difference in
accuracy statistically significant? Many ML algorithms are trained using
randomized algorithms like stochastic gradient descent, so you can’t expect
exactly the same results under these changes.
- The classification of NaïveBayes and the LinearSVC sometimes changed if one
is added to each feature value.
This might be due to nondeterminism as above, but it might also be due to
regularization or nonlinear effects for some algorithms. For example, some
algorithms might look at the relative values of features, in which case adding
1 to each feature value transforms the data. Other algorithms might require
that data be centered around a mean of 0 to work best.
I haven’t read the paper in detail, but basically it would be good to account
for randomized algorithms as well as various model assumptions, and make sure
the differences in results in these tests are statistically significant.
Matei
--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. herb...@cs.uni-goettingen.de
tel. +49 551 39-172037
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org