never mind... i didnt realize you were referring to the first table as df.
so you want to do a join between the first table and an RDD?
the right way to do it within the data frame construct is to think of it as
a join... map the second RDD to a data frame and do an inner join on ip
On Thu, May 21
Your original code snippet seems incomplete and there isn't enough
information to figure out what problem you actually ran into
from your original code snippet there is an rdd variable which is well
defined and a df variable that is not defined in the snippet of code you
sent
one way to make thi
Thanks. I suspected that, but figured that df query inside a map sounds so
intuitive that I don't just want to give up.
I've tried join and even better with a DStream.transform() and it works!
freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y):
y[1]))
Thank you for the help!
So DataFrames, like RDDs, can only be accused from the driver. If your IP
Frequency table is small enough you could collect it and distribute it as a
hashmap with broadcast or you could also join your rdd with the ip
frequency table. Hope that helps :)
On Thursday, May 21, 2015, ping yan wrote:
I have a dataframe as a reference table for IP frequencies.
e.g.,
ip freq
10.226.93.67 1
10.226.93.69 1
161.168.251.101 4
10.236.70.2 1
161.168.251.105 14
All I need is to query the df in a map.
rdd = sc.parallelize(['208.51.22.18', '31.207.6.17