Yes, in addition, I think Xiangrui updated the examples anyhow to use a different form that does not rely on zip:
test.map(v => (model.predict(v.features), v.label)) It avoid evaluating test twice, and avoids the zip. Although I suppose you have to bear in mind it now calls predict() on each element, not the whole RDD. On Tue, Jul 29, 2014 at 5:26 AM, Xiangrui Meng <men...@gmail.com> wrote: > Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and > master. If you don't want to switch to 1.0.1 or master, try to cache > and count test first. -Xiangrui > > On Mon, Jul 28, 2014 at 6:07 PM, SK <skrishna...@gmail.com> wrote: >> Hi, >> >> In order to evaluate the ML classification accuracy, I am zipping up the >> prediction and test labels as follows and then comparing the pairs in >> predictionAndLabel: >> >> val prediction = model.predict(test.map(_.features)) >> val predictionAndLabel = prediction.zip(test.map(_.label)) >> >> >> However, I am finding that predictionAndLabel.count() has fewer elements >> than test.count(). For example, my test vector has 43 elements, but >> predictionAndLabel has only 38 pairs. I have tried other samples and always >> get fewer elements after zipping. >> >> Does zipping the two vectors cause any compression? or is this because of >> the distributed nature of the algorithm (I am running it in local mode on a >> single machine). In order to get the correct accuracy, I need the above >> comparison to be done by a single node on the entire test data (my data is >> quite small). How can I ensure that? >> >> thanks >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com.