Hi,

We're looking at using Arrow as part of our solution to ship tabular data 
between different streaming systems, potentially implemented using different 
technologies, like Spark, Beam, Flink, etc. Some of these systems contain 
"watermarks" as a key concept. Briefly, a watermark is a promise that a certain 
data source will not produce any more events/rows with a timestamp earlier than 
a given time. For example, if I produce a batch of rows every 5 minutes, after 
I've finished sending the 12:00 data, I would send a watermark update of 
12:04:59, thus letting downstream consumers know that no future row from me 
will have a timestamp before 12:05.

We would like to be able to propagate watermarks with our data, and I wondered 
if this list has any ideas of how to do this currently, or whether it is part 
of the roadmap for the Arrow compute api or similar. We'd like to be able to do 
this over Arrow Flight, but potentially also for other methods of shipping 
Arrow data, like pubsub feeds, file dumps, etc.

Thanks
Matt Rudary
Two Sigma

Reply via email to