Hello All,

I am new to spark and python, here is my doubt, please suggest...

I have a csv file which has 2 column "user_id" and "status".
I have read it into a rdd and then removed the header of the csv file. Then I 
split the record by "," (comma) and generate pair rdd. On that rdd I 
groupByKey. Now that I am trying to gather the value only from rdd and create a 
list I am getting exceptions. Here is my code. Please suggest how can I just 
get the values from the grouped rdd and store them, csv has 2 columns...I am 
trying to extract using x[1]. Code below: The code in pyspark:

data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
header = data.first() #extract header
data = data.filter(lambda x:x !=header)    #filter out header
pairs = data.map(lambda x: (x.split(",")[0], x))#.collect()#generate pair rdd 
key value
grouped=pairs.groupByKey()#grouping values as per key
grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
print grouped_val



Thanks in Advance,
Sincerely,
Abhishek

Reply via email to