Thanks Charles. I just realized a few minutes ago that I neglected to show the step where I generated the key on the person ID. Thanks for the pointer on the HDFS URL.

Next step is to process data from multiple RDDS. My data originates from 7 tables in a MySQL database. I used sqoop to create avro files from these tables, and in turn created RDDs using SparkSQL from the avro files. Since the groupByKey only operates on a single RDD, I'm not quite sure yet how I'm going to process 7 tables as a transformation to get all the data I need into my objects.

I'm vascillating on whether I should be doing it this way, or if it would be a lot simpler to query MySQL to get all the Person IDs, parallelize them, and have my Person class make queries directly to the MySQL database. Since in theory I only have to do this once, I'm not sure there's much to be gained in moving the data from MySQL to Spark first.

I have yet to find any non-trivial examples of ETL logic on the web ... it seems like it's mostly word count map-reduce replacements.

On 02/16/2015 01:32 PM, Charles Feduke wrote:
I cannot comment about the correctness of Python code. I will assume your caper_kv is keyed on something that uniquely identifies all the rows that make up the person's record so your group by key makes sense, as does the map. (I will also assume all of the rows that comprise a single person's record will always fit in memory. If not you will need another approach.)

You should be able to get away with removing the "localhost:9000" from your HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let your HDFS configuration for Spark supply the missing pieces.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to