Hello All,
I am new to spark and python, here is my doubt, please suggest... I have a csv file which has 2 column "user_id" and "status". I have read it into a rdd and then removed the header of the csv file. Then I split the record by "," (comma) and generate pair rdd. On that rdd I groupByKey. Now that I am trying to gather the value only from rdd and create a list I am getting exceptions. Here is my code. Please suggest how can I just get the values from the grouped rdd and store them, csv has 2 columns...I am trying to extract using x[1]. Code below: The code in pyspark: data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv') header = data.first() #extract header data = data.filter(lambda x:x !=header) #filter out header pairs = data.map(lambda x: (x.split(",")[0], x))#.collect()#generate pair rdd key value grouped=pairs.groupByKey()#grouping values as per key grouped_val= grouped.map(lambda x : (list(x[1]))).collect() print grouped_val Thanks in Advance, Sincerely, Abhishek