Hi Mishra, 

I haven’t tested anything but : 

> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()

what is x[1] ? 


data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
header = data.first() #extract header
data = data.filter(lambda x:x !=header)    #filter out header
pairs = data.map(lambda x: (x.split(",")[0], x[1]))#<————————————————— only 
pass the status
grouped=pairs.groupByKey() <—————— x = (user_id, list of status for that user)) 
print grouped

is this what you want ? 



Jorge Machado
www.jmachado.me





> On 23/02/2016, at 20:26, Mishra, Abhishek <[email protected]> wrote:
> 
> Hello All,
>  
>  
> I am new to spark and python, here is my doubt, please suggest…
>  
> I have a csv file which has 2 column “user_id” and “status”.
> I have read it into a rdd and then removed the header of the csv file. Then I 
> split the record by “,” (comma) and generate pair rdd. On that rdd I 
> groupByKey. Now that I am trying to gather the value only from rdd and create 
> a list I am getting exceptions. Here is my code. Please suggest how can I 
> just get the values from the grouped rdd and store them, csv has 2 columns…I 
> am trying to extract using x[1]. Code below: The code in pyspark:
>  
> data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv' 
> <file:///home/cloudera/LDA-Model/Pyspark/test1.csv'>)
> header = data.first() #extract header
> data = data.filter(lambda x:x !=header)    #filter out header
> pairs = data.map(lambda x: (x.split(",")[0], x))#.collect()#generate pair rdd 
> key value
> grouped=pairs.groupByKey()#grouping values as per key 
> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
> print grouped_val
>  
>  
>  
> Thanks in Advance,
> Sincerely,
> Abhishek

Reply via email to