Re: What is the best way to organize a join within a foreach?

Mich Talebzadeh Tue, 25 Apr 2023 02:41:15 -0700

Have you thought of using  windowing function
<https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
achieve this?


Effectively all your information is in the orders table.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> I have two tables: {users, orders}. In this example, let's say that for
> each 1 User in the users table, there are 100000 Orders in the orders table.
>
> I have to use pyspark to generate a statement of Orders for each User. So,
> a single user will need his/her own list of Orders. Additionally, I need to
> send this statement to the real-world user via email (for example).
>
> My first intuition was to apply a DataFrame.foreach() on the users
> DataFrame. This way, I can rely on the spark workers to handle the email
> sending individually. However, I now do not know the best way to get each
> User's Orders.
>
> I will soon try the following (pseudo-code):
>
> ```
> users_df = <my entire users DataFrame>
> orders_df = <my entire orders DataFrame>
>
> #this is poorly named for max understandability in this context
> def foreach_function(row):
>   user_id = row.user_id
>   user_orders_df = orders_df.select(f'user_id = {user_id}')
>
>   #here, I'd get any User info from 'row'
>   #then, I'd convert all 'user_orders' to JSON
>   #then, I'd prepare the email and send it
>
> users_df.foreach(foreach_function)
> ```
>
> It is my understanding that if I do my user-specific work in the foreach
> function, I will capitalize on Spark's scalability when doing that work.
> However, I am worried of two things:
>
> If I take all Orders up front...
>
> Will that work?
> Will I be taking too much? Will I be taking Orders on partitions who won't
> handle them (different User).
>
> If I create the orders_df (filtered) within the foreach function...
>
> Will it work?
> Will that be too much IO to DB?
>
> The question ultimately is: How can I achieve this goal efficiently?
>
> I have not yet tried anything here. I am doing so as we speak, but am
> suffering from choice-paralysis.
>
> Please and thank you.
>

Re: What is the best way to organize a join within a foreach?

Reply via email to