Re: The best way to read historical data in a stream

Lasse Nedergaard Wed, 07 Jul 2021 15:01:58 -0700

Hi Arvid

Thanks for your input. My first implemention was iterations but I had a 
challenges to match up the returned rows with the original input so the current 
implementation use Async IO where  I attach the found rows with the input. It 
makes it easier downstream. I just have to test if I can keep up with the load 
and if not I have to increase the batch size.


Once again thanks for your input 

Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 7. jul. 2021 kl. 10.33 skrev Arvid Heise <ar...@apache.org>:
> 
> 
> Hi Lasse,
> 
> That's a tough question. The real Kappa way would be to load the full 
> database as a 2. input into the job and use joins. But I'm assuming that you 
> can't or don't want to do that.
> 
> 1. Can work if you use a windowing operator before and only trigger one or 
> few async IO calls per window batch.
> 
> 2. Tbh iterate streams are currently barely usable because of checkpointing 
> limitations.
> 
> 3. Process function can work if you have a large batch similar to 1. Here, 
> you'd use a sync communication pattern in contrast to 1.
> 
> 4. I don't think this is necessary. 
> 
> I think the secret to success is to reduce the number of queries as much as 
> possible. "Sometimes" already sounds as if it's not happening too frequently, 
> so I'm assuming with a bit of batching (through windows), both 1.+3. are 
> valid options depending if you want async or sync communication pattern.
> 
> Best,
> 
> Arvid
> 
>> On Mon, Jul 5, 2021 at 1:46 PM Lasse Nedergaard 
>> <lassenedergaardfl...@gmail.com> wrote:
>> Hi
>> 
>> I’m looking for some advice for the “right” way to load historical data into 
>> a stream. 
>> 
>> The case is as follow. 
>> I have a stream, sometimes I need to match the current live stream data up 
>> against data stored in database, let say elasticsearch, I generate a side 
>> output with the query information and now want get the rows from 
>> elasticsearch the number of rows can be high so I want to read in a 
>> paginated way and forward each response downstream as received. This also 
>> means that I have to execute n queries against elasticsearch and I have to 
>> do it in order and I don’t know how many. (Search response tell me if there 
>> is more data)
>> 
>> 1. Use Async IO
>> This work nice but if I read the data in a Paginated way I have to buffer 
>> all the data before I can return the result and it doesn’t scale. 
>> 
>> 2. Iterate stream
>> The requirement is more recursive than iteration and have some limitations 
>> regarding checkpoints. 
>> 
>> 3. Process function
>> Is not intended to do external IO operation as they take time to execute. 
>> 
>> 4. Elasticsearch source together with Kafka
>> Store the sideoutput I Kafka and create a elasticsearch / Kafka source 
>> function. Complicated 
>> 
>> There could be other ways of doing it and I’m open for good ideas and 
>> suggestions how to handle this challenge
>> 
>> Thanks in advance 
>> 
>> Med venlig hilsen / Best regards
>> Lasse Nedergaard
>>

Re: The best way to read historical data in a stream

Reply via email to