As mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy <https://github.com/spotify/annoy> for Approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected.
Below is complete code for reproducibility. The problem is highlighted in the difference seen when using Annoy with vs without Spark. from annoy import AnnoyIndex import random random.seed(42) f = 40 t = AnnoyIndex(f) # Length of item vector that will be indexed allvectors = [] for i in xrange(20): v = [random.gauss(0, 1) for z in xrange(f)] t.add_item(i, v) allvectors.append((i, v)) t.build(10) # 10 trees # Use Annoy with Spark sparkvectors = sc.parallelize(allvectors) bct = sc.broadcast(t) x = sparkvectors.map(lambda x: bct.value.get_nns_by_vector(vector=x[1], n=5)) print "Five closest neighbors for first vector with Spark:", print x.first() # Use Annoy without Spark print "Five closest neighbors for first vector without Spark:", print(t.get_nns_by_vector(vector=allvectors[0][1], n=5)) Output seen: Five closest neighbors for first vector with Spark: None Five closest neighbors for first vector without Spark: [0, 13, 12, 6, 4]