Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, Xianjin I checked user@spark.apache.org, and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org mailing list.

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter i

Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)), List((0,65

groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count

Environment Variables question

2014-08-01 Thread redocpot
Hi, According to the configuration guide (http://spark.apache.org/docs/latest/configuration.html#environment-variables), "Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed (or conf/spa

set spark.local.dir on driver program doesn't take effect

2014-07-31 Thread redocpot
Hi, When running spark on ec2 cluster, I find setting spark.local.dir on driver program doesn't take effect. INFO: - standalone mode - cluster launched via python script along with spark - instance type R3.large - ebs attached (using persistent-hdfs) - spark version: 1.0.0 prebuilt-hadoop1,sbt do

Re: sbt directory missed

2014-07-28 Thread redocpot
Thank you for your reply. I need sbt for packaging my project and then submit it. Could you tell me how to run a spark project on 1.0 AMI without sbt? I don't understand why 1.0 only contains the prebuilt packages. I dont think it makes sense, since sbt is essential. User has to download sbt or

Re: sbt directory missed

2014-07-28 Thread redocpot
update: Just checked the python launch script, when retrieving spark, it will refer to this script: https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh where each version number is mapped to a tar file, 0.9.2) if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then wget http://s3.am

sbt directory missed

2014-07-28 Thread redocpot
Hi, I have started a EC2 cluster using Spark by running spark-ec2 script. Just a little confused, I can not find sbt/ directory under /spark. I have checked spark-version, it's 1.0.0 (default). When I was working 0.9.x, sbt/ has been there. Is the script changed in 1.0.X ? I can not find any c

Re: implicit ALS dataSet

2014-06-23 Thread redocpot
Hi, The real-world dataset is a bit more large, so I tested on the MovieLens data set, and find the same results: alpha lambda rank top1 top5 EPR_in EPR_out 40 0.001 50 297 559 0.05855

Re: implicit ALS dataSet

2014-06-19 Thread redocpot
One thing needs to be mentioned is that, in fact, the schema is (userId, itemId, nbPurchase), where nbPurchase is equivalent to ratings. I found that there are many one-timers, which means the pairs whose nbPurchase = 1. The number of these pairs is about 85% of all positive observations. As the p

Re: implicit ALS dataSet

2014-06-19 Thread redocpot
Hi, Recently, I have launched a implicit ALS test on a real-world data set. Initially, we have 2 data set, one is the purchase record during 3 years past (training set), and the other is the one during 6 months just after the 3 years (test set) It's a database with 1060080 user and 23880 items.

Re: implicit ALS dataSet

2014-06-05 Thread redocpot
Thank you for your quick reply. As far as I know, the update does not require negative observations, because the update rule Xu = (YtCuY + λI)^-1 Yt Cu P(u) can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the firs

implicit ALS dataSet

2014-06-05 Thread redocpot
Hi, According to the paper on which MLlib's ALS is based, the model should take all user-item preferences as an input, including those which are not related to any input observation (zero preference). My question is: With all positive observations in hand (similar to explicit feedback data set),