Hi Arvid Thanks for your input. My first implemention was iterations but I had a challenges to match up the returned rows with the original input so the current implementation use Async IO where I attach the found rows with the input. It makes it easier downstream. I just have to test if I can keep up with the load and if not I have to increase the batch size.
Once again thanks for your input Med venlig hilsen / Best regards Lasse Nedergaard > Den 7. jul. 2021 kl. 10.33 skrev Arvid Heise <ar...@apache.org>: > > > Hi Lasse, > > That's a tough question. The real Kappa way would be to load the full > database as a 2. input into the job and use joins. But I'm assuming that you > can't or don't want to do that. > > 1. Can work if you use a windowing operator before and only trigger one or > few async IO calls per window batch. > > 2. Tbh iterate streams are currently barely usable because of checkpointing > limitations. > > 3. Process function can work if you have a large batch similar to 1. Here, > you'd use a sync communication pattern in contrast to 1. > > 4. I don't think this is necessary. > > I think the secret to success is to reduce the number of queries as much as > possible. "Sometimes" already sounds as if it's not happening too frequently, > so I'm assuming with a bit of batching (through windows), both 1.+3. are > valid options depending if you want async or sync communication pattern. > > Best, > > Arvid > >> On Mon, Jul 5, 2021 at 1:46 PM Lasse Nedergaard >> <lassenedergaardfl...@gmail.com> wrote: >> Hi >> >> I’m looking for some advice for the “right” way to load historical data into >> a stream. >> >> The case is as follow. >> I have a stream, sometimes I need to match the current live stream data up >> against data stored in database, let say elasticsearch, I generate a side >> output with the query information and now want get the rows from >> elasticsearch the number of rows can be high so I want to read in a >> paginated way and forward each response downstream as received. This also >> means that I have to execute n queries against elasticsearch and I have to >> do it in order and I don’t know how many. (Search response tell me if there >> is more data) >> >> 1. Use Async IO >> This work nice but if I read the data in a Paginated way I have to buffer >> all the data before I can return the result and it doesn’t scale. >> >> 2. Iterate stream >> The requirement is more recursive than iteration and have some limitations >> regarding checkpoints. >> >> 3. Process function >> Is not intended to do external IO operation as they take time to execute. >> >> 4. Elasticsearch source together with Kafka >> Store the sideoutput I Kafka and create a elasticsearch / Kafka source >> function. Complicated >> >> There could be other ways of doing it and I’m open for good ideas and >> suggestions how to handle this challenge >> >> Thanks in advance >> >> Med venlig hilsen / Best regards >> Lasse Nedergaard >>