If I understand this correctly, you could join area_code_user and
area_code_state and then flat map to get
user, areacode, state. Then groupby/reduce by user.

You can also try some join optimizations like partitioning on area code or
broadcasting smaller table depending on size of area_code_state.
On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote:

> Quick example problem that's stumping me:
>
> * Users have 1 or more phone numbers and therefore one or more area codes.
> * There are 100M users.
> * States have one or more area codes.
> * I would like to the states for the users (as indicated by phone area
> code).
>
> I was thinking about something like this:
>
> If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
> and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"])
> then we could do
>
> states_and_users_mixed = area_code_user.join(area_code_state) \
>     .reduceByKey(lambda a,b: a+b) \
>     .values()
>
> user_state_pairs = states_and_users_mixed.flatMap(
>         emit_cartesian_prod_of_userids_and_states )
> user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)
>
> user_to_states.first(1)
>
> >>> (1234567,["Tennessee","Tennessee","California"])
>
> This would work, but the user_state_pairs is just a list of user_ids and
> state names mixed together and emit_cartesian_prod_of_userids_and_states
> has to correctly pair them. This is problematic because 1) it's weird and
> sloppy and 2) there will be lots of users per state and having so many
> users in a single row is going to make
> emit_cartesian_prod_of_userids_and_states work extra hard to first locate
> states and then emit all userid-state pairs.
>
> How should I be doing this?
>
> Thanks,
> -John
>

Reply via email to