Yes we used collect to get all IDs in forEachBatch No, the static table is huge and is updated frequently by other systems
On Thu, Nov 26, 2020, 9:01 PM fanxin <fanxin...@gmail.com> wrote: > Hi, > If the static table is not particularly big and the modify frequency is > low, you can load the whole table as a DataFrame and persist it in the > memory. You may also need to repartition the DataFrame > > > On 11/26/2020 21:44,Geervan Hayatnagarkar<pande.a...@gmail.com> > <pande.a...@gmail.com> wrote: > > Hi > > We intend to do a stream-static join where kafka is a streaming source and > RDBMS is a static source. > > e.g. User activity data is coming in as a stream from Kafka source and we > need to pull User personal details from PostgreSQL. > > Because PostgreSQL is a static source, the entire "User-Personal-Details" > table is being reloaded into spark memory for every microbatch. > > Is there a way to optimise this? For example we should be able to pull > user-ids from every microbatch and then make a query as below ? > > select * from user-personal-details where user-id in > (<list-of-user-ids-from-current-microbatch>) > > While we could not find a clean way to do this, we chose to make a JDBC > connection for every microbatch and achieved the above optimisation. But > that is still suboptimal solution because JDBC connection is being created > for every micro-batch. Is there a way to pool JDBC connection in Structured > streaming? > > Thanks & regards, > Arti Pande > >