I'm a spark newbie working on his first attempt to do write an ETL program. I could use some feedback to make sure I'm on the right path. I've written a basic proof of concept that runs without errors and seems to work, although I might be missing some issues when this is actually run on more than a single node.

I am working with data about people (actually healthcare patients). I have an RDD that contains multiple rows per person. My overall goal is to create a single Person object for each person in my data. In this example, I am serializing to JSON, mostly because this is what I know how to do at the moment.

Other than general feedback, is my use of the groupByKey() and mapValues() methods appropriate?

Thanks!


import json

class Person:
    def __init__(self):
        self.mydata={}
        self.cpts = []
        self.mydata['cpt']=self.cpts
    def addRowData(self, dataRow):
        # Get the CPT codes
        cpt = dataRow.CPT_1
        if cpt:
            self.cpts.append(cpt)
    def serializeToJSON(self):
        return json.dumps(self.mydata)

def makeAPerson(rows):
    person = Person()
    for row in rows:
        person.addRowData(row)
    return person.serializeToJSON()

peopleRDD = caper_kv.groupByKey().mapValues(lambda personDataRows: makeAPerson(personDataRows))
peopleRDD.saveAsTextFile("hdfs://localhost:9000/sma/processJSON/people")


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to