So that's it no? you can push down the *in* filter in the query with the id's and only retrieve those rows.
On Thu, 26 Nov 2020 at 16:49, Geervan Hayatnagarkar <pande.a...@gmail.com> wrote: > Yes we used collect to get all IDs in forEachBatch > > No, the static table is huge and is updated frequently by other systems > > On Thu, Nov 26, 2020, 9:01 PM fanxin <fanxin...@gmail.com> wrote: > >> Hi, >> If the static table is not particularly big and the modify frequency is >> low, you can load the whole table as a DataFrame and persist it in the >> memory. You may also need to repartition the DataFrame >> >> >> On 11/26/2020 21:44,Geervan Hayatnagarkar<pande.a...@gmail.com> >> <pande.a...@gmail.com> wrote: >> >> Hi >> >> We intend to do a stream-static join where kafka is a streaming source >> and RDBMS is a static source. >> >> e.g. User activity data is coming in as a stream from Kafka source and we >> need to pull User personal details from PostgreSQL. >> >> Because PostgreSQL is a static source, the entire "User-Personal-Details" >> table is being reloaded into spark memory for every microbatch. >> >> Is there a way to optimise this? For example we should be able to pull >> user-ids from every microbatch and then make a query as below ? >> >> select * from user-personal-details where user-id in >> (<list-of-user-ids-from-current-microbatch>) >> >> While we could not find a clean way to do this, we chose to make a JDBC >> connection for every microbatch and achieved the above optimisation. But >> that is still suboptimal solution because JDBC connection is being created >> for every micro-batch. Is there a way to pool JDBC connection in Structured >> streaming? >> >> Thanks & regards, >> Arti Pande >> >>