Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread Sean Owen
I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: > Hi everyon

Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread jiayuanm
Sure. JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24754. I'll create the PR. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.or

Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread Kazuaki Ishizaki
Thank for you reporting this issue. I think this is a bug regarding integer overflow. IMHO, it would be good to compute hashes with Long. Would it be possible to create a JIRA entry? Do you want to submit a pull request, too? Regards, Kazuaki Ishizaki From: jiayuanm To: dev@spark.apa

[SPARK ML] Minhash integer overflow

2018-07-06 Thread jiayuanm
Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69). Since "a" and "b" are from a uniform distributio

code freeze and branch cut for Apache Spark 2.4

2018-07-06 Thread Reynold Xin
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.

Opentrace in ASF projects

2018-07-06 Thread Steve Loughran
FYI, there's some initial exploring of what it would take to move the HDFS wire protocol to move from HTrace for OpenTrace for tracing, and wire up the other stores too https://issues.apache.org/jira/browse/HADOOP-15566 If anyone has any input/insight or code review capacity, it'd be welcome.