Re: JdbcIO parallel read on spark

2021-05-27 Thread Thomas Fredriksen(External)
Quick update. After some testing, we have noticed that the splittable JdbcIO-poc works well when the number of splits does not exceed the number of spark tasks. In cases where the number of splits do exceed the task count, the pipeline freezes after each worker has processed a single split each.

Re: JdbcIO parallel read on spark

2021-05-25 Thread Thomas Fredriksen(External)
There is no forking after the "Generate Queries" transform. We noticed that the "Generate Queries" transform is in a different stage than the reading itself. This is likely due to the Reparallelize-transform, and we also see this with JdbcIO.readAll. After reading up on Splittable DoFn's, we deci

Re: JdbcIO parallel read on spark

2021-05-25 Thread Alexey Romanenko
Hi, Did you check a Spark DAG if it doesn’t fork branches after "Genereate queries” transform? — Alexey > On 24 May 2021, at 20:32, Thomas Fredriksen(External) > wrote: > > Hi there, > > We are struggling to get the JdbcIO-connector to read a large table on spark. > > In short - we wish to

JdbcIO parallel read on spark

2021-05-24 Thread Thomas Fredriksen(External)
Hi there, We are struggling to get the JdbcIO-connector to read a large table on spark. In short - we wish to read a large table (several billion rows), transform then write the transformed data to a new table. We are aware that `JdbcIO.read()` does not parallelize. In order to solve this, we at