Hi, Xianjin
I checked user@spark.apache.org, and found my post there:
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
I am using nabble to send this mail, which indicates that the mail will be
sent from my email address to the u...@spark.incubator.apache.org mailing
list.
Ah, thank you. I did not notice that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Hi,
I am using spark 1.0.0. The bug is fixed by 1.0.1.
Hao
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Thank you for your replies.
More details here:
The prog is executed on local mode (single node). Default env params are
used.
The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c
Here is 10 first lines of the data: 3 fields each row, the delimiter i
Update:
Just test with HashPartitioner(8) and count on each partition:
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591*), (*6,658327*), (*7,658434*)),
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657594)*, (6,658326), (*7,658434*)),
List((0,65
Hi,
I have a key-value RDD called rdd below. After a groupBy, I tried to count
rows.
But the result is not unique, somehow non deterministic.
Here is the test code:
val step1 = ligneReceipt_cleTable.persist
val step2 = step1.groupByKey
val s1size = step1.count
val s2size = step2.count
Hi,
According to the configuration guide
(http://spark.apache.org/docs/latest/configuration.html#environment-variables),
"Certain Spark settings can be configured through environment variables,
which are read from the conf/spark-env.sh script in the directory where
Spark is installed (or conf/spa
Hi,
When running spark on ec2 cluster, I find setting spark.local.dir on driver
program doesn't take effect.
INFO:
- standalone mode
- cluster launched via python script along with spark
- instance type R3.large
- ebs attached (using persistent-hdfs)
- spark version: 1.0.0 prebuilt-hadoop1,sbt do
Thank you for your reply.
I need sbt for packaging my project and then submit it.
Could you tell me how to run a spark project on 1.0 AMI without sbt?
I don't understand why 1.0 only contains the prebuilt packages. I dont think
it makes sense, since sbt is essential.
User has to download sbt or
update:
Just checked the python launch script, when retrieving spark, it will refer
to this script:
https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh
where each version number is mapped to a tar file,
0.9.2)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget
http://s3.am
Hi,
I have started a EC2 cluster using Spark by running spark-ec2 script.
Just a little confused, I can not find sbt/ directory under /spark.
I have checked spark-version, it's 1.0.0 (default). When I was working
0.9.x, sbt/ has been there.
Is the script changed in 1.0.X ? I can not find any c
Hi,
The real-world dataset is a bit more large, so I tested on the MovieLens
data set, and find the same results:
alpha
lambda
rank
top1
top5
EPR_in
EPR_out
40
0.001
50
297
559
0.05855
One thing needs to be mentioned is that, in fact, the schema is (userId,
itemId, nbPurchase), where nbPurchase is equivalent to ratings. I found that
there are many one-timers, which means the pairs whose nbPurchase = 1. The
number of these pairs is about 85% of all positive observations.
As the p
Hi,
Recently, I have launched a implicit ALS test on a real-world data set.
Initially, we have 2 data set, one is the purchase record during 3 years
past (training set), and the other is the one during 6 months just after the
3 years (test set)
It's a database with 1060080 user and 23880 items.
Thank you for your quick reply.
As far as I know, the update does not require negative observations, because
the update rule
Xu = (YtCuY + λI)^-1 Yt Cu P(u)
can be simplified by taking advantage of its algebraic structure, so
negative observations are not needed. This is what I think at the firs
Hi,
According to the paper on which MLlib's ALS is based, the model should take
all user-item preferences
as an input, including those which are not related to any input observation
(zero preference).
My question is:
With all positive observations in hand (similar to explicit feedback data
set),
16 matches
Mail list logo