Re: Query about `JdbcIO.PoolableDataSourceProvider`

Yi Hu Wed, 08 May 2024 08:56:04 -0700

Hi Vardhan,

I checked the source code and history of PoolableDataSourceProvider, here is my 
finding


- PoolableDataSourceProvider is already a static singleton [1], which means it 
is one connection for each DataSourceConfiguration, per worker. More 
specifically, multiple threads within a worker should share a connection, if 
connect to the same database. PoolableDataSourceProvider should also support 
connecting to different databases, because the underlying singleton Map is 
keyed by DataSourceConfiguration.

- However, I notice there is another open issue [2] claiming "the current 
implementation default parameters cannot cover all cases". I am wondering if 
this is the case and leads to "overwhelm the source db" you observe ?

[1] https://github.com/apache/beam/pull/8635

[2] https://github.com/apache/beam/issues/19393

In any case, one can define their own withDataSourceProviderFn (as mentioned by 
[2]) that implements a custom connection pool.

Best,
Yi

On 2024/05/04 12:18:47 Vardhan Thigle via user wrote:
> Hi Beam Experts,
> 
> I had a small query about `JdbcIO.PoolableDataSourceProvider`
> 
> As per main the documentation of JdbcIO
> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/jdbc/JdbcIO.ReadWithPartitions.html#withDataSourceConfiguration-org.apache.beam.sdk.io.jdbc.JdbcIO.DataSourceConfiguration->,
> (IIUC) `JdbcIO.PoolableDataSourceProvider` creates one DataSource per
> execution thread by default which can overwhelm the source db.
> 
> Where As
> 
> As per the Java doc of
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/jdbc/JdbcIO.PoolableDataSourceProvider.html>
> JdbcIO.PoolableDataSourceProvider,
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/jdbc/JdbcIO.PoolableDataSourceProvider.html>
> 
> 
> At most a single DataSource instance will be constructed during pipeline
> execution for each unique JdbcIO.DataSourceConfiguration
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/jdbc/JdbcIO.DataSourceConfiguration.html>
> within
> the pipeline.
> 
> If I want a singleton poolable connection for a given source database and
> my pipeline is dealing with multiple source databases, do I need to wrap
> the `JdbcIO.PoolableDataSourceProvider` in another concurrent hash map
> (from the implementation it looks lit that's what it does already and it's
> not needed)?I am a bit confused due to the variation in the 2 docs above
> (it's quite possible that I am interpreting them wrong)
> Would it be more recommended to rollout a custom class as suggested in the
> main documentation of JdbcIO
> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/jdbc/JdbcIO.ReadWithPartitions.html#withDataSourceConfiguration-org.apache.beam.sdk.io.jdbc.JdbcIO.DataSourceConfiguration->,
> in cases like:1. configure the poolconfig 2. Use an alternative source like
> say Hikari which If I understand correctly is not possible with
> JdbcIO.PoolableDataSourceProvider
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/jdbc/JdbcIO.PoolableDataSourceProvider.html>
> .
> 
> 
> 
> 
> Regards and Thanks,
> Vardhan Thigle,
> +919535346204
>

Re: Query about `JdbcIO.PoolableDataSourceProvider`

Reply via email to