Hi Mats, Glad to see this interest! We at Uber are also working on a Pinot sink (for BATCH execution), with some help from the Pinot community on abstracting Pinot interfaces for segment writes and catalog retrieval. Perhaps we can collaborate on this proposal/POC.
Cheers, Yupeng On Wed, Jan 6, 2021 at 6:12 AM Aljoscha Krettek <aljos...@apache.org> wrote: > That's good to hear. I wasn't sure because the explanation focused a lot > on checkpoints and the details of it while with the new Sink interface > implementers don't need to be concerned with those. And in fact, when > the Sink is used in BATCH execution mode there will be no checkpoints. > > Other than that, the implementation sketch makes sense to me. I think to > make further assessments you will probably have to work on a > proof-of-concept. > > Best, > Aljoscha > > On 2021/01/06 11:18, Poerschke, Mats wrote: > >Yes, we will use the latest sink interface. > > > >Best, > >Mats > > > >> On 6. Jan 2021, at 11:05, Aljoscha Krettek <aljos...@apache.org> wrote: > >> > >> It's great to see interest in this. Where you planning to use the new > Sink interface that we recently introduced? [1] > >> > >> Best, > >> Aljoscha > >> > >> [1] https://s.apache.org/FLIP-143 > >> > >> On 2021/01/05 12:21, Poerschke, Mats wrote: > >>> Hi all, > >>> > >>> we want to contribute a sink connector for Apache Pinot. The following > briefly describes the planned control flow. Please feel free to comment on > any of its aspects. > >>> > >>> Background > >>> Apache Pinot is a large-scale real-time data ingestion engine working > on data segments internally. The controller exposes an external API which > allows posting new segments via REST call. A thereby posted segment must > contain an id (called segment name). > >>> > >>> Control Flow > >>> The Flink sink will collect data tuples locally. When creating a > checkpoint, all those tuples are grouped into one segment which is then > pushed to the Pinot controller. We will assign each pushed segment a unique > incrementing identifier. > >>> After receiving a success response from the Pinot controller, the > latest segment name is persisted within the Flink checkpoint. > >>> In case we have to recover from a failure, the latest successfully > pushed segment name can be reconstructed from the Flink checkpoint. At this > point the system might be in an inconsistent state. The Pinot controller > might have already stored a newer segment (which’s name was, due to the > failure, not persisted in the flink checkpoint). > >>> This inconsistency is resolved with the next successful checkpoint > creation. The there pushed segment will get the same segment name assigned > as the inconsistent segment. Thus, Pinot replaces the old with the new > segment which prevents introducing duplicate entries. > >>> > >>> > >>> Best regards > >>> Mats Pörschke > >>> > > >