Hi,
If the static table is not particularly big and the modify frequency is low, you can load the whole table as a DataFrame and persist it in the memory. You may also need to repartition the DataFrame


On 11/26/2020 21:44Geervan Hayatnagarkar<pande.a...@gmail.com> wrote:
Hi

We intend to do a stream-static join where kafka is a streaming source and RDBMS is a static source. 

e.g. User activity data is coming in as a stream from Kafka source and we need to pull User personal details from PostgreSQL. 

Because PostgreSQL is a static source, the entire "User-Personal-Details" table is being reloaded into spark memory for every microbatch. 

Is there a way to optimise this? For example we should be able to pull user-ids from every microbatch and then make a query as below ?
 
select * from user-personal-details where user-id in (<list-of-user-ids-from-current-microbatch>)

While we could not find a clean way to do this, we chose to make a JDBC connection for every microbatch and achieved the above optimisation. But that is still suboptimal solution because JDBC connection is being created for every micro-batch. Is there a way to pool JDBC connection in Structured streaming?

Thanks & regards,
Arti Pande 

Reply via email to