Re: Using a ProcessFunction as a "Source"

Chesnay Schepler Sat, 25 Aug 2018 02:46:01 -0700

The SourceFunction interface is rather flexible so you can do prettymuch whatever you want. Exact implementation depends on whether controlmessages are pulled or pushed to the source; in the first case you'dsimply block within "run()" on the external call, in the latter you'dhave it block on a queue of some sort that is fed by another threadwaiting for messages.


AFAIK you should never use the collector outside of "processElement".


On 25.08.2018 05:15, vino yang wrote:

Hi Addison,
I have a lot of things I don't understand. Is your sourceself-generated message? Why can't source receive input? If the sourceis unacceptable then why is it called source? Isn't kafka-connectorthe input as source?
If you mean that under normal circumstances it can't receive anotherinput about control messages, there are some ways to solve it.
1) Access external systems in your source to get or subscribe tocontrol messages, such as Zookeeper.2) If your source is followed by a map or process operator, then theycan be chained together as a "big" source, then you can pass yourcontrol message via Flink's new feature "broadcast state". See thisblog post for details.[1]3) Mix control messages with normal messages in the same message flow.After the control message is parsed, the corresponding action istaken. Of course, this kind of program is not very recommended.
[1]:https://data-artisans.com/blog/a-practical-guide-to-broadcast-state-in-apache-flink
Thanks, vino.
Addison Higham <addis...@gmail.com <mailto:addis...@gmail.com>>于2018年8月25日周六上午12:46写道：
    HI,

    I am writing a parallel source function that ideally needs to
    receive some messages as control information (specifically, a
    state message on where to start reading from a kinesis stream). As
    far as I can tell, there isn't a way to make a sourceFunction
    receive input (which makes sense) so I am thinking it makes sense
    to use a processFunction that will occasionally receive control
    messages and mostly just output a lot of messages.

    This works from an API perspective, with a few different options,
    I could either:

    A) have the processElement function block on calling the loop that
    will produce messages or
    B) have the processEelement function return (by pass the collector
    and starting the reading on a different thread), but continue to
    produce messages downstream

    This obviously does raise some concerns though:

    1. Does this break any assumptions flink has of message lifecycle?
    Option A of blocking on processElement for very long periods seems
    straight forward but less than ideal, not to mention not being
    able to handle any other control messages.

    However, I am not sure if a processFunction sending messages after
    the processElement function has returned would break some
    expectations flink has of operator lifeycles. Messages are also
    emitted by timers, etc, but this would be completely outside any
    of those function calls as it is started on another thread. This
    is obviously how most SourceFunctions work, but it isn't clear if
    the same technique holds for ProcessFunctions

    2. Would this have a negative impact on backpressure downstream?
    Since I am still going to be using the same collector instance, it
    seems like it should ultimately work, but I wonder if there are
    other details I am not aware of.

    3. Is this just a terrible idea in general? It seems like I could
    maybe do this by implementing directly on top of an Operator, but
    I am not as familiar with that API

    Thanks in advance for any thoughts!

    Addison

Re: Using a ProcessFunction as a "Source"

Reply via email to