Stream-static join : Refreshing subset of static data / Connection pooling

Geervan Hayatnagarkar Thu, 26 Nov 2020 05:37:44 -0800

Hi

We intend to do a stream-static join where kafka is a streaming source and
RDBMS is a static source.


e.g. User activity data is coming in as a stream from Kafka source and we
need to pull User personal details from PostgreSQL.

Because PostgreSQL is a static source, the entire "User-Personal-Details"
table is being reloaded into spark memory for every microbatch.

Is there a way to optimise this? For example we should be able to pull
user-ids from every microbatch and then make a query as below ?

select * from user-personal-details where user-id in
(<list-of-user-ids-from-current-microbatch>)

While we could not find a clean way to do this, we chose to make a JDBC
connection for every microbatch and achieved the above optimisation. But
that is still suboptimal solution because JDBC connection is being created
for every micro-batch. Is there a way to pool JDBC connection in Structured
streaming?

Thanks & regards,
Arti Pande

Stream-static join : Refreshing subset of static data / Connection pooling

Reply via email to