Hi Everybody, I'm currently facing an issue where I'm not sure how to design it using apache beam. I'm batch processing data, it's about 300k entries per day. After doing some aggregations the results are about 60k entries.
The issue that I'm facing now is that the entries from that batch may be related to entries already processed at some time in the past and if they are, I would need to fetch the already processed record from the past and merge it with the new record. To make matters worse the "window" of that relationship might be several years, so I can't just sideload the last few days worth of data and catch all the relationships, I would need to on each batch run load all the already processed entries which seems not to be a good idea ;-) I also think that issuing 60k queries to always fetch the relevant related entry from the past for each new entry is a good idea. I could try to "window" it tho and group them by let's say 100 entries and fire a query to fetch the 100 old entries for the current 100 processed entries... that would at least reduce the amount of queries by 60k/100. Are there any other good ways to solve issues like that? I would imagine that those situations should be quite common. Maybe there are some best practices around this issue. It's basically enriching already processed entries with information from new entries. Would be great if someone could point me in the right direction or give me some more keywords that I can google. Thanks and regards Jo
