If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user.
You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of area_code_state. On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote: > Quick example problem that's stumping me: > > * Users have 1 or more phone numbers and therefore one or more area codes. > * There are 100M users. > * States have one or more area codes. > * I would like to the states for the users (as indicated by phone area > code). > > I was thinking about something like this: > > If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567]) > and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"]) > then we could do > > states_and_users_mixed = area_code_user.join(area_code_state) \ > .reduceByKey(lambda a,b: a+b) \ > .values() > > user_state_pairs = states_and_users_mixed.flatMap( > emit_cartesian_prod_of_userids_and_states ) > user_to_states = user_state_pairs.reduceByKey(lambda a,b: a+b) > > user_to_states.first(1) > > >>> (1234567,["Tennessee","Tennessee","California"]) > > This would work, but the user_state_pairs is just a list of user_ids and > state names mixed together and emit_cartesian_prod_of_userids_and_states > has to correctly pair them. This is problematic because 1) it's weird and > sloppy and 2) there will be lots of users per state and having so many > users in a single row is going to make > emit_cartesian_prod_of_userids_and_states work extra hard to first locate > states and then emit all userid-state pairs. > > How should I be doing this? > > Thanks, > -John >