Thanks Charles. I just realized a few minutes ago that I neglected to
show the step where I generated the key on the person ID. Thanks for the
pointer on the HDFS URL.
Next step is to process data from multiple RDDS. My data originates from
7 tables in a MySQL database. I used sqoop to create avro files from
these tables, and in turn created RDDs using SparkSQL from the avro
files. Since the groupByKey only operates on a single RDD, I'm not quite
sure yet how I'm going to process 7 tables as a transformation to get
all the data I need into my objects.
I'm vascillating on whether I should be doing it this way, or if it
would be a lot simpler to query MySQL to get all the Person IDs,
parallelize them, and have my Person class make queries directly to the
MySQL database. Since in theory I only have to do this once, I'm not
sure there's much to be gained in moving the data from MySQL to Spark first.
I have yet to find any non-trivial examples of ETL logic on the web ...
it seems like it's mostly word count map-reduce replacements.
On 02/16/2015 01:32 PM, Charles Feduke wrote:
I cannot comment about the correctness of Python code. I will assume
your caper_kv is keyed on something that uniquely identifies all the
rows that make up the person's record so your group by key makes
sense, as does the map. (I will also assume all of the rows that
comprise a single person's record will always fit in memory. If not
you will need another approach.)
You should be able to get away with removing the "localhost:9000" from
your HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let your
HDFS configuration for Spark supply the missing pieces.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org