Re: Stream-static join : Refreshing subset of static data / Connection pooling

chen kevin Sun, 29 Nov 2020 00:53:06 -0800

Hi,

you can use Debezium to capture real-timely the row-level changes in 
PostgreSql, then stream them to kafka, finally etl and write  the data to hbase 
by flink/spark streaming。So you can join the data in hbase directly. in 
consideration of the particularly big table,  the scan performance  in hbase is 
much better than PostgreSql.

From: German Schiavon <gschiavonsp...@gmail.com>
Date: Friday, November 27, 2020 at 12:09 AM
To: Geervan Hayatnagarkar <pande.a...@gmail.com>
Cc: fanxin <fanxin...@gmail.com>, User <user@spark.apache.org>
Subject: Re: Stream-static join : Refreshing subset of static data / Connection 
pooling

So that's it no? you can push down the in filter in the query with the id's and 
only retrieve those rows.

On Thu, 26 Nov 2020 at 16:49, Geervan Hayatnagarkar 
<pande.a...@gmail.com<mailto:pande.a...@gmail.com>> wrote:
Yes we used collect to get all IDs in forEachBatch

No, the static table is huge and is updated frequently by other systems

On Thu, Nov 26, 2020, 9:01 PM fanxin 
<fanxin...@gmail.com<mailto:fanxin...@gmail.com>> wrote:
Hi,
If the static table is not particularly big and the modify frequency is low, 
you can load the whole table as a DataFrame and persist it in the memory. You 
may also need to repartition the DataFrame

On 11/26/2020 21:44，Geervan 
Hayatnagarkar<pande.a...@gmail.com><mailto:pande.a...@gmail.com> wrote：
Hi

We intend to do a stream-static join where kafka is a streaming source and 
RDBMS is a static source.

e.g. User activity data is coming in as a stream from Kafka source and we need 
to pull User personal details from PostgreSQL.

Because PostgreSQL is a static source, the entire "User-Personal-Details" table 
is being reloaded into spark memory for every microbatch.

Is there a way to optimise this? For example we should be able to pull user-ids 
from every microbatch and then make a query as below ?

select * from user-personal-details where user-id in 
(<list-of-user-ids-from-current-microbatch>)

While we could not find a clean way to do this, we chose to make a JDBC 
connection for every microbatch and achieved the above optimisation. But that 
is still suboptimal solution because JDBC connection is being created for every 
micro-batch. Is there a way to pool JDBC connection in Structured streaming?

Thanks & regards,
Arti Pande

Re: Stream-static join : Refreshing subset of static data / Connection pooling

Reply via email to