Hi Sean, I think the simplest way is to return a *breeze.linalg.HashVector*
when *org.apache.spark.ml.linalg.SparseVector#asBreeze* is called, and use
a lazy value to store that vector because the construction of
*breeze.linalg.HashVector* has some extra performance cost.

The code will be like

class SparseVector @Since("2.0.0") (
    override val size: Int,
    @Since("2.0.0") val indices: Array[Int],
    @Since("2.0.0") val values: Array[Double]) extends Vector {

  lazy val breezeVector =
breeze.linalg.HashVector.apply(size)(indices.zip(values): _*)

  .....

  private[spark] override def asBreeze: BV[Double] = breezeVector

  ....

}



I'm not sure this is the best approach so I think I can file an issue for
further discussion.

Thanks,
Huafeng


Sean Owen <sro...@gmail.com>于2018年7月2日周一 上午11:38写道:

> This could be a good optimization. But can it be done without changing
> any APIs or slowing anything else down? if so this could be worth a
> pull request.
> On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote:
> >
> > Hi there,
> >
> > I'm using GBTClassifier do some classification jobs and find the
> performance of scoring stage is not quite satisfying. The trained model has
> about 160 trees and the input feature vector is sparse and its size is
> about 20+.
> >
> > After some digging, I found the model will repeatedly and randomly
> access feature in SparseVector when predicting an input vector, which will
> eventually call function breeze.linalg.SparseVector#apply. That function
> generally uses a binary search to locate the corresponding index so the
> complexity is O(log numNonZero).
> >
> > Then I tried to convert my feature vectors to dense vectors before
> inference and the result shows that the inference stage can speed up for
> about 2~3 times. (Random access in DenseVector is O(1))
> >
> > So my question is why not use breeze.linalg.HashVector when randomly
> accessing values in SpareVector since the complexity is O(1) according to
> Breeze's documentation, much better than the SparseVector in such case.
> >
> > Thanks,
> > Vincent
>

Reply via email to