Hi,
If the static table is not particularly big and the modify frequency is low, you can load the whole table as a DataFrame and persist it in the memory. You may also need to repartition the DataFrame
On 11/26/2020 21:44,Geervan Hayatnagarkar<pande.a...@gmail.com> wrote:
HiWe intend to do a stream-static join where kafka is a streaming source and RDBMS is a static source.e.g. User activity data is coming in as a stream from Kafka source and we need to pull User personal details from PostgreSQL.Because PostgreSQL is a static source, the entire "User-Personal-Details" table is being reloaded into spark memory for every microbatch.Is there a way to optimise this? For example we should be able to pull user-ids from every microbatch and then make a query as below ?select * from user-personal-details where user-id in (<list-of-user-ids-from-current-microbatch>)While we could not find a clean way to do this, we chose to make a JDBC connection for every microbatch and achieved the above optimisation. But that is still suboptimal solution because JDBC connection is being created for every micro-batch. Is there a way to pool JDBC connection in Structured streaming?Thanks & regards,Arti Pande