Re: The best way to read historical data in a stream

Arvid Heise Wed, 07 Jul 2021 01:33:28 -0700

Hi Lasse,

That's a tough question. The real Kappa way would be to load the full
database as a 2. input into the job and use joins. But I'm assuming that
you can't or don't want to do that.

1. Can work if you use a windowing operator before and only trigger one or
few async IO calls per window batch.

2. Tbh iterate streams are currently barely usable because of checkpointing
limitations.

3. Process function can work if you have a large batch similar to 1. Here,
you'd use a sync communication pattern in contrast to 1.

4. I don't think this is necessary.

I think the secret to success is to reduce the number of queries as much as
possible. "Sometimes" already sounds as if it's not happening too
frequently, so I'm assuming with a bit of batching (through windows), both
1.+3. are valid options depending if you want async or sync communication
pattern.

Best,

Arvid

On Mon, Jul 5, 2021 at 1:46 PM Lasse Nedergaard <
lassenedergaardfl...@gmail.com> wrote:

> Hi
>
> I’m looking for some advice for the “right” way to load historical data
> into a stream.
>
> The case is as follow.
> I have a stream, sometimes I need to match the current live stream data up
> against data stored in database, let say elasticsearch, I generate a side
> output with the query information and now want get the rows from
> elasticsearch the number of rows can be high so I want to read in a
> paginated way and forward each response downstream as received. This also
> means that I have to execute n queries against elasticsearch and I have to
> do it in order and I don’t know how many. (Search response tell me if there
> is more data)
>
> 1. Use Async IO
> This work nice but if I read the data in a Paginated way I have to buffer
> all the data before I can return the result and it doesn’t scale.
>
> 2. Iterate stream
> The requirement is more recursive than iteration and have some limitations
> regarding checkpoints.
>
> 3. Process function
> Is not intended to do external IO operation as they take time to execute.
>
> 4. Elasticsearch source together with Kafka
> Store the sideoutput I Kafka and create a elasticsearch / Kafka source
> function. Complicated
>
> There could be other ways of doing it and I’m open for good ideas and
> suggestions how to handle this challenge
>
> Thanks in advance
>
> Med venlig hilsen / Best regards
> Lasse Nedergaard
>
>

Re: The best way to read historical data in a stream

Reply via email to