Quick update.
After some testing, we have noticed that the splittable JdbcIO-poc works
well when the number of splits does not exceed the number of spark tasks.
In cases where the number of splits do exceed the task count, the pipeline
freezes after each worker has processed a single split each.
There is no forking after the "Generate Queries" transform.
We noticed that the "Generate Queries" transform is in a different stage
than the reading itself. This is likely due to the Reparallelize-transform,
and we also see this with JdbcIO.readAll.
After reading up on Splittable DoFn's, we deci
Hi,
Did you check a Spark DAG if it doesn’t fork branches after "Genereate queries”
transform?
—
Alexey
> On 24 May 2021, at 20:32, Thomas Fredriksen(External)
> wrote:
>
> Hi there,
>
> We are struggling to get the JdbcIO-connector to read a large table on spark.
>
> In short - we wish to
Hi there,
We are struggling to get the JdbcIO-connector to read a large table on
spark.
In short - we wish to read a large table (several billion rows), transform
then write the transformed data to a new table.
We are aware that `JdbcIO.read()` does not parallelize. In order to solve
this, we at