Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
never mind... i didnt realize you were referring to the first table as df. so you want to do a join between the first table and an RDD? the right way to do it within the data frame construct is to think of it as a join... map the second RDD to a data frame and do an inner join on ip On Thu, May 21

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
Your original code snippet seems incomplete and there isn't enough information to figure out what problem you actually ran into from your original code snippet there is an rdd variable which is well defined and a df variable that is not defined in the snippet of code you sent one way to make thi

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
Thanks. I suspected that, but figured that df query inside a map sounds so intuitive that I don't just want to give up. I've tried join and even better with a DStream.transform() and it works! freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y): y[1])) Thank you for the help!

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Holden Karau
So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :) On Thursday, May 21, 2015, ping yan wrote:

Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
I have a dataframe as a reference table for IP frequencies. e.g., ip freq 10.226.93.67 1 10.226.93.69 1 161.168.251.101 4 10.236.70.2 1 161.168.251.105 14 All I need is to query the df in a map. rdd = sc.parallelize(['208.51.22.18', '31.207.6.17