Yes, in addition, I think Xiangrui updated the examples anyhow to use
a different form that does not rely on zip:

test.map(v => (model.predict(v.features), v.label))

It avoid evaluating test twice, and avoids the zip. Although I suppose
you have to bear in mind it now calls predict() on each element, not
the whole RDD.

On Tue, Jul 29, 2014 at 5:26 AM, Xiangrui Meng <men...@gmail.com> wrote:
> Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
> master. If you don't want to switch to 1.0.1 or master, try to cache
> and count test first. -Xiangrui
>
> On Mon, Jul 28, 2014 at 6:07 PM, SK <skrishna...@gmail.com> wrote:
>> Hi,
>>
>> In order to evaluate the ML classification accuracy, I am zipping up the
>> prediction and test labels as follows and then comparing the pairs in
>> predictionAndLabel:
>>
>> val prediction = model.predict(test.map(_.features))
>> val predictionAndLabel = prediction.zip(test.map(_.label))
>>
>>
>> However, I am finding that predictionAndLabel.count() has fewer elements
>> than test.count().  For example, my test vector has 43 elements, but
>> predictionAndLabel has only 38 pairs. I have tried other samples and always
>> get fewer elements after zipping.
>>
>> Does zipping the two vectors cause any compression? or is this because of
>> the distributed nature of the algorithm (I am running it in local mode on a
>> single machine). In order to get the correct accuracy, I need the above
>> comparison to be done by a single node on the entire test data (my data is
>> quite small). How can I ensure that?
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to