Hi,
I have three rdds.. X,y and p
X is matrix rdd (mXn), y is (mX1) dimension vector
and p is (mX1) dimension probability vector.
Now, I am trying to sample k rows from X and corresponding entries in y
based on probability vector p.
Here is the python implementation
import randomfrom bisect import bisectfrom operator import itemgetter
def sample(population, k, prob):
def cdf(population, k, prob):
population = map(itemgetter(1), sorted(zip(prob, population)))
cumm = [prob[0]]
for i in range(1, len(prob)):
cumm.append(_cumm[-1] + prob[i])
return [population[bisect(cumm, random.random())] for i in range(k)]
return cdf(population, k, prob)