Spark newbie desires feedback on first program

Eric Bell Mon, 16 Feb 2015 12:39:02 -0800

I'm a spark newbie working on his first attempt to do write an ETLprogram. I could use some feedback to make sure I'm on the right path.I've written a basic proof of concept that runs without errors and seemsto work, although I might be missing some issues when this is actuallyrun on more than a single node.

I am working with data about people (actually healthcare patients). Ihave an RDD that contains multiple rows per person. My overall goal isto create a single Person object for each person in my data. In thisexample, I am serializing to JSON, mostly because this is what I knowhow to do at the moment.

Other than general feedback, is my use of the groupByKey() andmapValues() methods appropriate?


Thanks!


import json

class Person:
    def __init__(self):
        self.mydata={}
        self.cpts = []
        self.mydata['cpt']=self.cpts
    def addRowData(self, dataRow):
        # Get the CPT codes
        cpt = dataRow.CPT_1
        if cpt:
            self.cpts.append(cpt)
    def serializeToJSON(self):
        return json.dumps(self.mydata)

def makeAPerson(rows):
    person = Person()
    for row in rows:
        person.addRowData(row)
    return person.serializeToJSON()

peopleRDD = caper_kv.groupByKey().mapValues(lambda personDataRows:makeAPerson(personDataRows))

peopleRDD.saveAsTextFile("hdfs://localhost:9000/sma/processJSON/people")


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark newbie desires feedback on first program

Reply via email to