Re: Limiting the Number of Parallel Reads/Writes

2025-04-10 Thread Jonathan Hope
Right so in that case we really have one knob to turn: the number of records (bundle size). We would still want to choose some kind of reasonable upper bound for the number of records being read. In the case where the collection/partition being read from has 2 billion things say we decide to split

Re: Limiting the Number of Parallel Reads/Writes

2025-04-09 Thread Radek Stankiewicz via user
in the context of mongodb - there are already configuration pieces: https://github.com/apache/beam/blob/7136380c4a79f8dea9b42a42ee7569b665edf431/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L230 bucketAuto or numSplits. exact logic how it would split it is writt

Re: Limiting the Number of Parallel Reads/Writes

2025-04-09 Thread Jonathan Hope
I think I'm still struggling a bit... Let's stick with a bounded example for now. I would be reading from a single mongo cluster/database/collection/partition that has billions of things in it. I read through the mongoio code a bit and it seems to: 1. Get the min ID 2. Get the max ID 3.

Re: Limiting the Number of Parallel Reads/Writes

2025-04-09 Thread Radek Stankiewicz via user
hey Jonathan, parallelism for read and write is directly related to the amount of keys you are processing in the current stage. As an example - Imagine you have KafkaIO with 1 partition - and after reading from KafkaIO you have a mapping step to JDBC entity and then you have a step writing to the