Sorry if my terminology is misleading. What I meant under driver only is to use a local pandas dataframe (collect the data to the master), and keep updating that instead of dealing with a spark distributed dataframe for holding this data.
For example, we have a dataframe with all users and their corresponding latest activity timestamp. After each streaming batch, aggregations are performed and the calculation is collected to the driver to update a subset of users latest activity timestamp. On Sat, 9 Jan 2021, 6:18 pm Artemis User, <arte...@dtechspace.com> wrote: > Could you please clarify what do you mean by 1)? Driver is only > responsible for submitting Spark job, not performing. > > -- ND > > On 1/9/21 9:35 AM, András Kolbert wrote: > > Hi, > > I would like to get your advice on my use case. > > I have a few spark streaming applications where I need to keep > > updating a dataframe after each batch. Each batch probably affects a > > small fraction of the dataframe (5k out of 200k records). > > > > The options I have been considering so far: > > 1) keep dataframe on the driver, and update that after each batch > > 2) keep dataframe distributed, and use checkpointing to mitigate lineage > > > > I solved previous use cases with option 2, but I am not sure if it is > > the most optimal as checkpointing is relatively expensive. I also > > wondered about HBASE or some sort of quick access memory storage, > > however it is currently not in my stack. > > > > Curious to hear your thoughts > > > > Andras > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >