The loss function here
<https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal>
for logistic regression is confusing. It seems to imply that spark uses
only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous
note quoted below (under Classification) says. We need to make this point
more visible to avoid confusion.

Better yet, we should replace the loss function listed with that for 0, 1
no matter how mathematically inconvenient, since that is what is actually
implemented in Spark.

More problematic, the loss function (even in this "convenient" form) is
actually incorrect. This is because it is missing either a summation
(sigma) in the log or product (pi) outside the log, as the loss for
logistic is the log likelihood. So there are multiple problems with the
documentation. Please advise on steps to fix for all version documentation
or if there are already some in place.

"Note that, in the mathematical formulation in this guide, a binary label
y is denoted as either +1 (positive) or −1 (negative), which is convenient
for the formulation. *However*, the negative label is represented by 0 in
spark.mllib instead of −1, to be consistent with multiclass labeling."

Reply via email to