Hi all, we want to contribute a sink connector for Apache Pinot. The following briefly describes the planned control flow. Please feel free to comment on any of its aspects.
Background Apache Pinot is a large-scale real-time data ingestion engine working on data segments internally. The controller exposes an external API which allows posting new segments via REST call. A thereby posted segment must contain an id (called segment name). Control Flow The Flink sink will collect data tuples locally. When creating a checkpoint, all those tuples are grouped into one segment which is then pushed to the Pinot controller. We will assign each pushed segment a unique incrementing identifier. After receiving a success response from the Pinot controller, the latest segment name is persisted within the Flink checkpoint. In case we have to recover from a failure, the latest successfully pushed segment name can be reconstructed from the Flink checkpoint. At this point the system might be in an inconsistent state. The Pinot controller might have already stored a newer segment (which’s name was, due to the failure, not persisted in the flink checkpoint). This inconsistency is resolved with the next successful checkpoint creation. The there pushed segment will get the same segment name assigned as the inconsistent segment. Thus, Pinot replaces the old with the new segment which prevents introducing duplicate entries. Best regards Mats Pörschke