Hello Daniel,
I was thinking if you can write
catGroupArr.map(lambda line: create_and_write_file(line))
def create_and_write_file(line):
1. look at the key of line: line[0]
2. Open a file with required file name based on key
3. iterate through the values of this key,value pair
for ele in line[1]:
4. Write every ele into the file created.
5. Close the file.
Do you think this works?
Thanks
Abhishek S
Thank you!
With Regards,
Abhishek S
On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <[email protected]>
wrote:
> Hello everyone,
>
> I have a PairRDD with a set of key and list of values, each value in the
> list is a json which I already loaded beginning of my spark app, how can I
> iterate over each value of the list in my pair RDD to transform it to a
> string then save the whole content of the key to a file? one file per key
>
> my input files look like cat-0-500.txt:
>
> *{cat:'red',value:'asd'}*
> *{cat:'green',value:'zxc'}*
> *{cat:'red',value:'jkl'}*
>
> The PairRDD looks like
>
> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
> *('green', [{cat:'green',value:'zxc'}])*
>
> so as you can see I I'd like to serialize each json in the value list back
> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a
> separate file for each key
>
> The way I got here:
>
> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
> *import json*
> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
> *categories = categoriesJson*
>
> *catByDate = categories.map(lambda x: (x['cat'], x)*
> *catGroup = catByDate.groupByKey()*
> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>
> Ideally I want to create a cat-red.txt that looks like:
>
> {cat:'red',value:'asd'}
> {cat:'red',value:'jkl'}
>
> and the same for the rest of the keys.
>
> I already looked at this answer
> <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
> but
> I'm slightly lost as host to process each value in the list (turn into
> string) before I save the contents to a file, also I cannot figure out how
> to import *MultipleTextOutputFormat* in python either.
>
> I'm trying all this wacky stuff in the pyspark shell
>
> Any advice would be greatly appreciated
>
> Thanks in advance!
>