Right so in that case we really have one knob to turn: the number of
records (bundle size). We would still want to choose some kind of
reasonable upper bound for the number of records being read. In the case
where the collection/partition being read from has 2 billion things say we
decide to split
in the context of mongodb - there are already configuration pieces:
https://github.com/apache/beam/blob/7136380c4a79f8dea9b42a42ee7569b665edf431/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L230
bucketAuto or numSplits.
exact logic how it would split it is writt
I think I'm still struggling a bit...
Let's stick with a bounded example for now. I would be reading from a
single mongo cluster/database/collection/partition that has billions of
things in it. I read through the mongoio code a bit and it seems to:
1. Get the min ID
2. Get the max ID
3.
hey Jonathan,
parallelism for read and write is directly related to the amount of keys
you are processing in the current stage.
As an example - Imagine you have KafkaIO with 1 partition - and after
reading from KafkaIO you have a mapping step to JDBC entity and then you
have a step writing to the