+1 As Yupeng mentioned, we at Uber are also looking into the Pinot Sink. It would be great to collaborate on this proposal.
Thanks, Sanath On Wed, Jan 6, 2021 at 9:23 AM Yupeng Fu <yup...@uber.com.invalid> wrote: > Hi Mats, > > Glad to see this interest! We at Uber are also working on a Pinot sink (for > BATCH execution), with some help from the Pinot community on abstracting > Pinot interfaces for segment writes and catalog retrieval. Perhaps we can > collaborate on this proposal/POC. > > Cheers, > > Yupeng > > > > On Wed, Jan 6, 2021 at 6:12 AM Aljoscha Krettek <aljos...@apache.org> > wrote: > > > That's good to hear. I wasn't sure because the explanation focused a lot > > on checkpoints and the details of it while with the new Sink interface > > implementers don't need to be concerned with those. And in fact, when > > the Sink is used in BATCH execution mode there will be no checkpoints. > > > > Other than that, the implementation sketch makes sense to me. I think to > > make further assessments you will probably have to work on a > > proof-of-concept. > > > > Best, > > Aljoscha > > > > On 2021/01/06 11:18, Poerschke, Mats wrote: > > >Yes, we will use the latest sink interface. > > > > > >Best, > > >Mats > > > > > >> On 6. Jan 2021, at 11:05, Aljoscha Krettek <aljos...@apache.org> > wrote: > > >> > > >> It's great to see interest in this. Where you planning to use the new > > Sink interface that we recently introduced? [1] > > >> > > >> Best, > > >> Aljoscha > > >> > > >> [1] https://s.apache.org/FLIP-143 > > >> > > >> On 2021/01/05 12:21, Poerschke, Mats wrote: > > >>> Hi all, > > >>> > > >>> we want to contribute a sink connector for Apache Pinot. The > following > > briefly describes the planned control flow. Please feel free to comment > on > > any of its aspects. > > >>> > > >>> Background > > >>> Apache Pinot is a large-scale real-time data ingestion engine working > > on data segments internally. The controller exposes an external API which > > allows posting new segments via REST call. A thereby posted segment must > > contain an id (called segment name). > > >>> > > >>> Control Flow > > >>> The Flink sink will collect data tuples locally. When creating a > > checkpoint, all those tuples are grouped into one segment which is then > > pushed to the Pinot controller. We will assign each pushed segment a > unique > > incrementing identifier. > > >>> After receiving a success response from the Pinot controller, the > > latest segment name is persisted within the Flink checkpoint. > > >>> In case we have to recover from a failure, the latest successfully > > pushed segment name can be reconstructed from the Flink checkpoint. At > this > > point the system might be in an inconsistent state. The Pinot controller > > might have already stored a newer segment (which’s name was, due to the > > failure, not persisted in the flink checkpoint). > > >>> This inconsistency is resolved with the next successful checkpoint > > creation. The there pushed segment will get the same segment name > assigned > > as the inconsistent segment. Thus, Pinot replaces the old with the new > > segment which prevents introducing duplicate entries. > > >>> > > >>> > > >>> Best regards > > >>> Mats Pörschke > > >>> > > > > > >