Understanding the semantics of SourceContext.collect

Yuval Itzchakov Wed, 04 Aug 2021 11:10:02 -0700

Hi,

I have a question regarding the semantics of event processing from a source
downstream that I want to clarify.


I have a custom source which offloads data from our data warehouse. In my
custom source, I have some state which keeps track of the latest timestamps
that were read. When unloading, I push the data from the warehouse to S3
and then scan the S3 bucket prefix in to know which files to read. These S3
file descriptors are only later read in an AbstractStreamOperator that is
the direct child of the source, and there is a rehash boundary between the
two.

The ASO receives these S3 file descriptors one by one, reads them and does
some additional parsing and sends them downstream to various operators.

My question is: is there a guarantee, that once an element is being pushed
downstream from the Source function to the ASO using SourceContext.collect,
it will only return once the element has been processed throughout the
entire execution graph?

The reason I'm asking this is because I'm doing some refactoring work on
the ASO and I'm wondering whether these file prefixes that are received via
the processElement function in the ASO should be stored in a state
somewhere up until I finish their processing. If SourceContext.collect does
provide the guarantee that once pushed, it will only return when the
element has passed through the entire DAG (even though some of the data
needs to go over the wire) then I don't need any additional storage for
that state and can rely on that fact that it has synchronously been pushed
end to end.

-- 
Best Regards,
Yuval Itzchakov.

Understanding the semantics of SourceContext.collect

Reply via email to