Re: sampling in spark

Davies Liu Tue, 28 Oct 2014 00:44:38 -0700

        _cumm = [p[0]]
        for i in range(1, len(p)):
            _cumm.append(_cumm[-1] + p[i])
        index = set([bisect(_cumm, random.random()) for i in range(k)])


        chosed_x = X.zipWithIndex().filter(lambda (v, i): i in
index).map(lambda (v, i): v)
        chosed_y = [v for i, v in enumerate(y) if i in index]


On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <[email protected]> wrote:
> Oops, the reference for the above code:
> http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
>
> On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <[email protected]>
> wrote:
>>
>> Hi,
>>   I have three rdds.. X,y and p
>> X is matrix rdd (mXn), y is (mX1) dimension vector
>> and p is (mX1) dimension probability vector.
>> Now, I am trying to sample k rows from X and corresponding entries in y
>> based on probability vector p.
>> Here is the python implementation
>>
>> import random
>> from bisect import bisect
>> from operator import itemgetter
>>
>> def sample(population, k, prob):
>>
>>     def cdf(population, k, prob):
>>         population = map(itemgetter(1), sorted(zip(prob, population)))
>>         cumm = [prob[0]]
>>         for i in range(1, len(prob)):
>>
>>             cumm.append(_cumm[-1] + prob[i])
>>         return [population[bisect(cumm, random.random())] for i in
>> range(k)]
>>
>>
>>      return cdf(population, k, prob)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: sampling in spark

Reply via email to