Yes we used collect to get all IDs in forEachBatch

No, the static table is huge and is updated frequently by other systems

On Thu, Nov 26, 2020, 9:01 PM fanxin <fanxin...@gmail.com> wrote:

> Hi,
> If the static table is not particularly big and the modify frequency is
> low, you can load the whole table as a DataFrame and persist it in the
> memory. You may also need to repartition the DataFrame
>
>
> On 11/26/2020 21:44,Geervan Hayatnagarkar<pande.a...@gmail.com>
> <pande.a...@gmail.com> wrote:
>
> Hi
>
> We intend to do a stream-static join where kafka is a streaming source and
> RDBMS is a static source.
>
> e.g. User activity data is coming in as a stream from Kafka source and we
> need to pull User personal details from PostgreSQL.
>
> Because PostgreSQL is a static source, the entire "User-Personal-Details"
> table is being reloaded into spark memory for every microbatch.
>
> Is there a way to optimise this? For example we should be able to pull
> user-ids from every microbatch and then make a query as below ?
>
> select * from user-personal-details where user-id in
> (<list-of-user-ids-from-current-microbatch>)
>
> While we could not find a clean way to do this, we chose to make a JDBC
> connection for every microbatch and achieved the above optimisation. But
> that is still suboptimal solution because JDBC connection is being created
> for every micro-batch. Is there a way to pool JDBC connection in Structured
> streaming?
>
> Thanks & regards,
> Arti Pande
>
>

Reply via email to