Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3702#issuecomment-68295668
  
    @jkbradley You'll end up with one `BinaryLabelCounter` per partition _per 
distinct key_ though. That's where the problem may occur.
    
    I think this would definitely be used to pick thresholds, and for that, you 
don't need to download the curve, just find the optimal point. There, I'd say 
you simply don't bin at all, since it means the curve is approximate and 
down-sampled, and there's probably not much value in approximation.
    
    I'm not terribly wedded to the change. It helps in the niche use case that 
one does want to down-sample. It adds some complexity though, and complexity 
adds up. 
    
    It's also possible to down-sample the final curve, later. That could just 
be a utility function somewhere instead of injected into here. Would that be 
better? I think you lose a bit of information that way but it's an 
approximation to begin with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to