Re: Spark newbie desires feedback on first program

Eric Bell Mon, 16 Feb 2015 13:43:52 -0800

Thanks Charles. I just realized a few minutes ago that I neglected toshow the step where I generated the key on the person ID. Thanks for thepointer on the HDFS URL.

Next step is to process data from multiple RDDS. My data originates from7 tables in a MySQL database. I used sqoop to create avro files fromthese tables, and in turn created RDDs using SparkSQL from the avrofiles. Since the groupByKey only operates on a single RDD, I'm not quitesure yet how I'm going to process 7 tables as a transformation to getall the data I need into my objects.

I'm vascillating on whether I should be doing it this way, or if itwould be a lot simpler to query MySQL to get all the Person IDs,parallelize them, and have my Person class make queries directly to theMySQL database. Since in theory I only have to do this once, I'm notsure there's much to be gained in moving the data from MySQL to Spark first.

I have yet to find any non-trivial examples of ETL logic on the web ...it seems like it's mostly word count map-reduce replacements.


On 02/16/2015 01:32 PM, Charles Feduke wrote:

I cannot comment about the correctness of Python code. I will assumeyour caper_kv is keyed on something that uniquely identifies all therows that make up the person's record so your group by key makessense, as does the map. (I will also assume all of the rows thatcomprise a single person's record will always fit in memory. If notyou will need another approach.)
You should be able to get away with removing the "localhost:9000" fromyour HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let yourHDFS configuration for Spark supply the missing pieces.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark newbie desires feedback on first program

Reply via email to