_cumm = [p[0]]
for i in range(1, len(p)):
_cumm.append(_cumm[-1] + p[i])
index = set([bisect(_cumm, random.random()) for i in range(k)])
chosed_x = X.zipWithIndex().filter(lambda (v, i): i in
index).map(lambda (v, i): v)
chosed_y = [v for i, v in enumerate(y) if i in index]
On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <[email protected]> wrote:
> Oops, the reference for the above code:
> http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
>
> On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <[email protected]>
> wrote:
>>
>> Hi,
>> I have three rdds.. X,y and p
>> X is matrix rdd (mXn), y is (mX1) dimension vector
>> and p is (mX1) dimension probability vector.
>> Now, I am trying to sample k rows from X and corresponding entries in y
>> based on probability vector p.
>> Here is the python implementation
>>
>> import random
>> from bisect import bisect
>> from operator import itemgetter
>>
>> def sample(population, k, prob):
>>
>> def cdf(population, k, prob):
>> population = map(itemgetter(1), sorted(zip(prob, population)))
>> cumm = [prob[0]]
>> for i in range(1, len(prob)):
>>
>> cumm.append(_cumm[-1] + prob[i])
>> return [population[bisect(cumm, random.random())] for i in
>> range(k)]
>>
>>
>> return cdf(population, k, prob)
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]