Re: Identify orphan records after joining two streams

Hequn Cheng Mon, 15 Apr 2019 09:58:51 -0700

Hi Averell,

> I feel that it's over-complicated
I think a Table API or SQL[1] job can also achieve what you want. Probably
more simple and takes up less code.
The workflow looks like:
1. union all two source tables. You may need to unify the schema of the two
tables as union all can only used to union tables with the same schema.
2. perform window group by, i.e., group by tumbling window + key.
3. write an user-defined aggregate function[2] which is used to merge the
data.


> my cached data would build up and make my cluster out-of-memory.
You can use the `RocksDBStateBackend`[3]. The amount of state that you can
keep is only limited by the amount of disk space available.

> Would back-pressure kicks in for this case?
It seems there are no direct ways to aliment different sources now.
However, the community has already discussed and trying to solve it[4].

Best, Hequn

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/udfs.html#aggregation-functions
[3]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html#the-rocksdbstatebackend
[4]
https://docs.google.com/document/d/13HxfsucoN7aUls1FX202KFAVZ4IMLXEZWyRfZNd6WFM/edit#heading=h.lfi2u4fv9cxa

On Mon, Apr 15, 2019 at 8:08 PM Averell <lvhu...@gmail.com> wrote:

> Hello,
>
> I have two data streams, and want to join them using a tumbling window.
> Each
> of the streams would have at most one record per window. There is also a
> requirement to log/save the records that don't have a companion from the
> other stream.
> What would be the best option for my case? Would that be possible to use
> Flink's Join?
>
> I tried to use CoProcessFunction: truncating the timestamp of each record
> to
> the beginning of the tumbling window, and then "keyBy" the two streams
> using
> (key, truncated-timestamp). When I receive a record from one stream, if
> that's the first record of the pair, then I save it to a MapState. If it is
> the 2nd record, then I merge with the 1st one then fire.
> This implementation works, but
> (a) I feel that it's over-complicated, and
> (b) I have a concern that when one stream is slower than the other, my
> cached data would build up and make my cluster out-of-memory. Would
> back-pressure kicks in for this case?
>
> Thanks and best regards,
> Averell
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Identify orphan records after joining two streams

Reply via email to