Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and master. If you don't want to switch to 1.0.1 or master, try to cache and count test first. -Xiangrui
On Mon, Jul 28, 2014 at 6:07 PM, SK <skrishna...@gmail.com> wrote: > Hi, > > In order to evaluate the ML classification accuracy, I am zipping up the > prediction and test labels as follows and then comparing the pairs in > predictionAndLabel: > > val prediction = model.predict(test.map(_.features)) > val predictionAndLabel = prediction.zip(test.map(_.label)) > > > However, I am finding that predictionAndLabel.count() has fewer elements > than test.count(). For example, my test vector has 43 elements, but > predictionAndLabel has only 38 pairs. I have tried other samples and always > get fewer elements after zipping. > > Does zipping the two vectors cause any compression? or is this because of > the distributed nature of the algorithm (I am running it in local mode on a > single machine). In order to get the correct accuracy, I need the above > comparison to be done by a single node on the entire test data (my data is > quite small). How can I ensure that? > > thanks > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.