Hi!

I guess you could use *foreachBatch* and do something like

foreachBatch {
ids = getIds
spark.read.jdbc(query where id is in $ids)
join
write
}

The only thing is that in order to get he id's you would have to do collect
no? Or how are you retrieving them right now?


On Thu, 26 Nov 2020 at 14:38, Geervan Hayatnagarkar <pande.a...@gmail.com>
wrote:

> Hi
>
> We intend to do a stream-static join where kafka is a streaming source and
> RDBMS is a static source.
>
> e.g. User activity data is coming in as a stream from Kafka source and we
> need to pull User personal details from PostgreSQL.
>
> Because PostgreSQL is a static source, the entire "User-Personal-Details"
> table is being reloaded into spark memory for every microbatch.
>
> Is there a way to optimise this? For example we should be able to pull
> user-ids from every microbatch and then make a query as below ?
>
> select * from user-personal-details where user-id in
> (<list-of-user-ids-from-current-microbatch>)
>
> While we could not find a clean way to do this, we chose to make a JDBC
> connection for every microbatch and achieved the above optimisation. But
> that is still suboptimal solution because JDBC connection is being created
> for every micro-batch. Is there a way to pool JDBC connection in Structured
> streaming?
>
> Thanks & regards,
> Arti Pande
>

Reply via email to