Re: Flink Job and Watermarking

Chesnay Schepler Fri, 08 Feb 2019 00:59:45 -0800

Have you considered using the metric system to access the currentwatermarks for each operator? (seehttps://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#io)


On 08.02.2019 03:19, Kaustubh Rudrawar wrote:

Hi,
I'm writing a job that wants to make an HTTP request once a watermarkhas reached all tasks of an operator. It would be great if this couldbe determined from outside the Flink job, but I don't think it'spossible to access watermark information for the job as a whole. Belowis a workaround I've come up with:
 1. Read messages from Kafka using the provided KafkaSource. Event
    time will be defined as a timestamp within the message.
 2. Key the stream based on an id from the message.
 3. DedupOperator that dedupes messages. This operator will run with a
    parallelism of N.
 4. An operator that persists the messages to S3. It doesn't need to
    output anything - it should ideally be a Sink (if it were a sink
    we could use the StreamingFileSink).
 5. Implement an operator that will make an HTTP request once
    processWatermark is called for time T. A parallelism of 1 will be
    used for this operator as it will do very little work. Because it
    has a parallelism of 1, the operator in step 4 cannot send
    anything to it as it could become a throughput bottleneck.
Does this implementation seem like a valid workaround? Any otheralternatives I should consider?
Thanks for your help,
Kaustubh

Re: Flink Job and Watermarking

Reply via email to