Re:Stream-static join : Refreshing subset of static data / Connection pooling

fanxin Thu, 26 Nov 2020 07:31:30 -0800

Hi,

If the static table is not particularly big and the modify frequency is low, you can load the whole table as a DataFrame and persist it in the memory. You may also need to repartition the DataFrame

On 11/26/2020 21:44，Geervan Hayatnagarkar<pande.a...@gmail.com> wrote：

Hi

We intend to do a stream-static join where kafka is a streaming source and RDBMS is a static source.

e.g. User activity data is coming in as a stream from Kafka source and we need to pull User personal details from PostgreSQL.

Because PostgreSQL is a static source, the entire "User-Personal-Details" table is being reloaded into spark memory for every microbatch.

Is there a way to optimise this? For example we should be able to pull user-ids from every microbatch and then make a query as below ?

select * from user-personal-details where user-id in (<list-of-user-ids-from-current-microbatch>)

While we could not find a clean way to do this, we chose to make a JDBC connection for every microbatch and achieved the above optimisation. But that is still suboptimal solution because JDBC connection is being created for every micro-batch. Is there a way to pool JDBC connection in Structured streaming?

Thanks & regards,
Arti Pande

Re:Stream-static join : Refreshing subset of static data / Connection pooling

Reply via email to