I'm a spark newbie working on his first attempt to do write an ETL
program. I could use some feedback to make sure I'm on the right path.
I've written a basic proof of concept that runs without errors and seems
to work, although I might be missing some issues when this is actually
run on more than a single node.
I am working with data about people (actually healthcare patients). I
have an RDD that contains multiple rows per person. My overall goal is
to create a single Person object for each person in my data. In this
example, I am serializing to JSON, mostly because this is what I know
how to do at the moment.
Other than general feedback, is my use of the groupByKey() and
mapValues() methods appropriate?
Thanks!
import json
class Person:
def __init__(self):
self.mydata={}
self.cpts = []
self.mydata['cpt']=self.cpts
def addRowData(self, dataRow):
# Get the CPT codes
cpt = dataRow.CPT_1
if cpt:
self.cpts.append(cpt)
def serializeToJSON(self):
return json.dumps(self.mydata)
def makeAPerson(rows):
person = Person()
for row in rows:
person.addRowData(row)
return person.serializeToJSON()
peopleRDD = caper_kv.groupByKey().mapValues(lambda personDataRows:
makeAPerson(personDataRows))
peopleRDD.saveAsTextFile("hdfs://localhost:9000/sma/processJSON/people")
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org