Hi Beam Experts,I have a small query about `JdbcIO#readWithPartitions`

ContextJdbcIO#readWithPartitions seems to always default to 200 partitions
(DEFAULT_NUM_PARTITIONS). This is set by default when the object is
constructed here
<https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L362>
There seems to be no way to override this with a null value. Hence it seems
that, the code
<https://github.com/apache/beam/blob/b50ad0fe8fc168eaded62efb08f19cf2aea341e2/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1398>
that
checks the null value and tries to auto infer the number of partitions
based on the never runs.I am trying to use this for reading a tall table of
unknown size, and the pipeline always defaults to 200 if the value is not
set.  The default of 200 seems to fall short as worker goes out of memory
in reshuffle stage. Running with higher number of partitions like 4K helps
for my test setup.Since the size is not known at the time of implementing
the pipeline, the auto-inference might help setting maxPartitions to a
reasonable value as per the heuristic decided by Beam code.
Request for help

Could you please clarify a few doubts around this?

   1. Is this behavior intentional?
   2. Could you please explain the rationale behind the heuristic in L1398
   
<https://github.com/apache/beam/blob/b50ad0fe8fc168eaded62efb08f19cf2aea341e2/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1398>
    and DEFAULT_NUM_PARTITIONS=200?


I have also raised this as issues/31467 incase it needs any changes in the
implementation.


Regards and Thanks,
Vardhan Thigle,
+919535346204

Reply via email to