Re: Splitting stream

Arvid Heise Mon, 10 May 2021 09:33:03 -0700

Hi Nikola,

side outputs definitively are at least as efficient as using two filters
but they are also harder to implement and maintain. Do you actually have a
use case where every bit of performance counts?


If so, please also check enableObjectReuse [1] and look into serialization
[2].

Also if you can implement your use case with Table API/SQL (with UDFs), it
will be much faster than other alternatives.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/execution_configuration/
[2]
https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Mon, May 10, 2021 at 4:52 PM Taher Koitawala <taher...@gmail.com> wrote:

> I think what your looking for is a a side output. Change the logic to a
> process function. What is true goes to collector false can go to a side
> output. Which now gives you 2 streams
>
> On Mon, May 10, 2021, 8:14 PM Nikola Hrusov <n.hru...@gmail.com> wrote:
>
>> Hi Arvid,
>>
>> In my case it's the latter, thus I have also thought about using the
>> filter (map is not useful in my case).
>>
>> What I am not sure which is better to be used?
>> In what case would you split a stream with side output and in what case
>> with filter?
>> Would there be any performance gain/pain based on which is used?
>>
>> Regards
>> ,
>> Nikola
>> <%28%2B45%29%2060%2054%2032%2016>
>>
>>
>> On Mon, May 10, 2021 at 6:00 PM Arvid Heise <ar...@apache.org> wrote:
>>
>>> Hi Nikola,
>>>
>>> if you just want to apply a different user function to the records
>>> depending on the property "exist" the simplest way is to use
>>>
>>> source -> map(if exist do this else that) -> sink
>>>
>>> If it turns out that you want to apply a different subgraph, you can do
>>>
>>> source -> filter(if exist) -> do this -> union -> sink
>>> source -> filter(if not exist) -> do that -^
>>>
>>> On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <n.hru...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to find some information on what is the best way to split a
>>>> stream of the same data.
>>>>
>>>> For the given scenario: I have an object which has a property "exist"
>>>>
>>>> I want to split the stream based on this property, do something, and
>>>> afterwards join it again into a single stream.
>>>>
>>>> Initial (A) -> Split stream based on exist (B) or not (C) -> union both
>>>> streams (D)
>>>>
>>>> I could find some similar topics on StackOverflow:
>>>> -
>>>> https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream
>>>> -
>>>> https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach
>>>>
>>>> but none of them really gives a definitive answer.
>>>>
>>>> What I am thinking about is using 1) filter or 2) side output.
>>>>
>>>> I know that one of the use cases of side output is that it can have
>>>> different data types. That is not my case as it will be the same object
>>>> going through the whole pipeline.
>>>>
>>>> So both options look more or less the same to me, however I do not know
>>>> the flink internals as good as I would like to as of this point.
>>>>
>>>> Can some of you guys shed some light and perhaps tell me if I am
>>>> mistaken in my thoughts?
>>>>
>>>> Thanks.
>>>>
>>>> Regards
>>>> ,
>>>> Nikola
>>>>
>>>

Re: Splitting stream

Reply via email to