PairRDD(K, L) to multiple files by key serializing each value in L before

Daniel Valdivia Tue, 15 Dec 2015 17:06:12 -0800

Hello everyone,

I have a PairRDD with a set of key and list of values, each value in the list 
is a json which I already loaded beginning of my spark app, how can I iterate 
over each value of the list in my pair RDD to transform it to a string then 
save the whole content of the key to a file? one file per key


my input files look like cat-0-500.txt:

{cat:'red',value:'asd'}
{cat:'green',value:'zxc'}
{cat:'red',value:'jkl'}

The PairRDD looks like

('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])
('green', [{cat:'green',value:'zxc'}])

so as you can see I I'd like to serialize each json in the value list back to 
string so I can easily saveAsTextFile(), ofcourse I'm trying to save a separate 
file for each key

The way I got here:

rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")
import json
categoriesJson = rawcatRdd.map(lambda x: json.loads(x))
categories = categoriesJson

catByDate = categories.map(lambda x: (x['cat'], x)
catGroup = catByDate.groupByKey()
catGroupArr = catGroup.mapValues(lambda x : list(x))

Ideally I want to create a cat-red.txt that looks like:

{cat:'red',value:'asd'}
{cat:'red',value:'jkl'}

and the same for the rest of the keys.

I already looked at this answer 
<http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
 but I'm slightly lost as host to process each value in the list (turn into 
string) before I save the contents to a file, also I cannot figure out how to 
import MultipleTextOutputFormat in python either.

I'm trying all this wacky stuff in the pyspark shell

Any advice would be greatly appreciated

Thanks in advance!

PairRDD(K, L) to multiple files by key serializing each value in L before

Reply via email to