Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-68295668
@jkbradley You'll end up with one `BinaryLabelCounter` per partition _per
distinct key_ though. That's where the problem may occur.
I think this would definitely be used to pick thresholds, and for that, you
don't need to download the curve, just find the optimal point. There, I'd say
you simply don't bin at all, since it means the curve is approximate and
down-sampled, and there's probably not much value in approximation.
I'm not terribly wedded to the change. It helps in the niche use case that
one does want to down-sample. It adds some complexity though, and complexity
adds up.
It's also possible to down-sample the final curve, later. That could just
be a utility function somewhere instead of injected into here. Would that be
better? I think you lose a bit of information that way but it's an
approximation to begin with.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]