So that's it no? you can push down the *in* filter in the query with the
id's and only retrieve those rows.

On Thu, 26 Nov 2020 at 16:49, Geervan Hayatnagarkar <pande.a...@gmail.com>
wrote:

> Yes we used collect to get all IDs in forEachBatch
>
> No, the static table is huge and is updated frequently by other systems
>
> On Thu, Nov 26, 2020, 9:01 PM fanxin <fanxin...@gmail.com> wrote:
>
>> Hi,
>> If the static table is not particularly big and the modify frequency is
>> low, you can load the whole table as a DataFrame and persist it in the
>> memory. You may also need to repartition the DataFrame
>>
>>
>> On 11/26/2020 21:44,Geervan Hayatnagarkar<pande.a...@gmail.com>
>> <pande.a...@gmail.com> wrote:
>>
>> Hi
>>
>> We intend to do a stream-static join where kafka is a streaming source
>> and RDBMS is a static source.
>>
>> e.g. User activity data is coming in as a stream from Kafka source and we
>> need to pull User personal details from PostgreSQL.
>>
>> Because PostgreSQL is a static source, the entire "User-Personal-Details"
>> table is being reloaded into spark memory for every microbatch.
>>
>> Is there a way to optimise this? For example we should be able to pull
>> user-ids from every microbatch and then make a query as below ?
>>
>> select * from user-personal-details where user-id in
>> (<list-of-user-ids-from-current-microbatch>)
>>
>> While we could not find a clean way to do this, we chose to make a JDBC
>> connection for every microbatch and achieved the above optimisation. But
>> that is still suboptimal solution because JDBC connection is being created
>> for every micro-batch. Is there a way to pool JDBC connection in Structured
>> streaming?
>>
>> Thanks & regards,
>> Arti Pande
>>
>>

Reply via email to