Apache Pinot Sink

Poerschke, Mats Tue, 05 Jan 2021 04:21:22 -0800

Hi all,

we want to contribute a sink connector for Apache Pinot. The following briefly 
describes the planned control flow. Please feel free to comment on any of its 
aspects.


Background
Apache Pinot is a large-scale real-time data ingestion engine working on data 
segments internally. The controller exposes an external API which allows 
posting new segments via REST call. A thereby posted segment must contain an id 
(called segment name).

Control Flow
The Flink sink will collect data tuples locally. When creating a checkpoint, 
all those tuples are grouped into one segment which is then pushed to the Pinot 
controller. We will assign each pushed segment a unique incrementing identifier.
After receiving a success response from the Pinot controller, the latest 
segment name is persisted within the Flink checkpoint.
In case we have to recover from a failure, the latest successfully pushed 
segment name can be reconstructed from the Flink checkpoint. At this point the 
system might be in an inconsistent state. The Pinot controller might have 
already stored a newer segment (which’s name was, due to the failure, not 
persisted in the flink checkpoint).
This inconsistency is resolved with the next successful checkpoint creation. 
The there pushed segment will get the same segment name assigned as the 
inconsistent segment. Thus, Pinot replaces the old with the new segment which 
prevents introducing duplicate entries.


Best regards
Mats Pörschke

Apache Pinot Sink

Reply via email to