Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Sean Owen Sun, 01 Jul 2018 20:39:02 -0700

This could be a good optimization. But can it be done without changing
any APIs or slowing anything else down? if so this could be worth a
pull request.
On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote:
>
> Hi there,
>
> I'm using GBTClassifier do some classification jobs and find the performance 
> of scoring stage is not quite satisfying. The trained model has about 160 
> trees and the input feature vector is sparse and its size is about 20+.
>
> After some digging, I found the model will repeatedly and randomly access 
> feature in SparseVector when predicting an input vector, which will 
> eventually call function breeze.linalg.SparseVector#apply. That function 
> generally uses a binary search to locate the corresponding index so the 
> complexity is O(log numNonZero).
>
> Then I tried to convert my feature vectors to dense vectors before inference 
> and the result shows that the inference stage can speed up for about 2~3 
> times. (Random access in DenseVector is O(1))
>
> So my question is why not use breeze.linalg.HashVector when randomly 
> accessing values in SpareVector since the complexity is O(1) according to 
> Breeze's documentation, much better than the SparseVector in such case.
>
> Thanks,
> Vincent


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Reply via email to