It's great to see interest in this. Where you planning to use the new
Sink interface that we recently introduced? [1]
Best,
Aljoscha
[1] https://s.apache.org/FLIP-143
On 2021/01/05 12:21, Poerschke, Mats wrote:
Hi all,
we want to contribute a sink connector for Apache Pinot. The following briefly
describes the planned control flow. Please feel free to comment on any of its
aspects.
Background
Apache Pinot is a large-scale real-time data ingestion engine working on data
segments internally. The controller exposes an external API which allows
posting new segments via REST call. A thereby posted segment must contain an id
(called segment name).
Control Flow
The Flink sink will collect data tuples locally. When creating a checkpoint,
all those tuples are grouped into one segment which is then pushed to the Pinot
controller. We will assign each pushed segment a unique incrementing identifier.
After receiving a success response from the Pinot controller, the latest
segment name is persisted within the Flink checkpoint.
In case we have to recover from a failure, the latest successfully pushed
segment name can be reconstructed from the Flink checkpoint. At this point the
system might be in an inconsistent state. The Pinot controller might have
already stored a newer segment (which’s name was, due to the failure, not
persisted in the flink checkpoint).
This inconsistency is resolved with the next successful checkpoint creation.
The there pushed segment will get the same segment name assigned as the
inconsistent segment. Thus, Pinot replaces the old with the new segment which
prevents introducing duplicate entries.
Best regards
Mats Pörschke