Hi Mishra,
I haven’t tested anything but :
> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
what is x[1] ?
data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
header = data.first() #extract header
data = data.filter(lambda x:x !=header) #filter out header
pairs = data.map(lambda x: (x.split(",")[0], x[1]))#<————————————————— only
pass the status
grouped=pairs.groupByKey() <—————— x = (user_id, list of status for that user))
print grouped
is this what you want ?
Jorge Machado
www.jmachado.me
> On 23/02/2016, at 20:26, Mishra, Abhishek <[email protected]> wrote:
>
> Hello All,
>
>
> I am new to spark and python, here is my doubt, please suggest…
>
> I have a csv file which has 2 column “user_id” and “status”.
> I have read it into a rdd and then removed the header of the csv file. Then I
> split the record by “,” (comma) and generate pair rdd. On that rdd I
> groupByKey. Now that I am trying to gather the value only from rdd and create
> a list I am getting exceptions. Here is my code. Please suggest how can I
> just get the values from the grouped rdd and store them, csv has 2 columns…I
> am trying to extract using x[1]. Code below: The code in pyspark:
>
> data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv'
> <file:///home/cloudera/LDA-Model/Pyspark/test1.csv'>)
> header = data.first() #extract header
> data = data.filter(lambda x:x !=header) #filter out header
> pairs = data.map(lambda x: (x.split(",")[0], x))#.collect()#generate pair rdd
> key value
> grouped=pairs.groupByKey()#grouping values as per key
> grouped_val= grouped.map(lambda x : (list(x[1]))).collect()
> print grouped_val
>
>
>
> Thanks in Advance,
> Sincerely,
> Abhishek