Re: Apache Pinot Sink

Yupeng Fu Wed, 06 Jan 2021 09:23:07 -0800

Hi Mats,

Glad to see this interest! We at Uber are also working on a Pinot sink (for
BATCH execution), with some help from the Pinot community on abstracting
Pinot interfaces for segment writes and catalog retrieval. Perhaps we can
collaborate on this proposal/POC.


Cheers,

Yupeng



On Wed, Jan 6, 2021 at 6:12 AM Aljoscha Krettek <aljos...@apache.org> wrote:

> That's good to hear. I wasn't sure because the explanation focused a lot
> on checkpoints and the details of it while with the new Sink interface
> implementers don't need to be concerned with those. And in fact, when
> the Sink is used in BATCH execution mode there will be no checkpoints.
>
> Other than that, the implementation sketch makes sense to me. I think to
> make further assessments you will probably have to work on a
> proof-of-concept.
>
> Best,
> Aljoscha
>
> On 2021/01/06 11:18, Poerschke, Mats wrote:
> >Yes, we will use the latest sink interface.
> >
> >Best,
> >Mats
> >
> >> On 6. Jan 2021, at 11:05, Aljoscha Krettek <aljos...@apache.org> wrote:
> >>
> >> It's great to see interest in this. Where you planning to use the new
> Sink interface that we recently introduced? [1]
> >>
> >> Best,
> >> Aljoscha
> >>
> >> [1] https://s.apache.org/FLIP-143
> >>
> >> On 2021/01/05 12:21, Poerschke, Mats wrote:
> >>> Hi all,
> >>>
> >>> we want to contribute a sink connector for Apache Pinot. The following
> briefly describes the planned control flow. Please feel free to comment on
> any of its aspects.
> >>>
> >>> Background
> >>> Apache Pinot is a large-scale real-time data ingestion engine working
> on data segments internally. The controller exposes an external API which
> allows posting new segments via REST call. A thereby posted segment must
> contain an id (called segment name).
> >>>
> >>> Control Flow
> >>> The Flink sink will collect data tuples locally. When creating a
> checkpoint, all those tuples are grouped into one segment which is then
> pushed to the Pinot controller. We will assign each pushed segment a unique
> incrementing identifier.
> >>> After receiving a success response from the Pinot controller, the
> latest segment name is persisted within the Flink checkpoint.
> >>> In case we have to recover from a failure, the latest successfully
> pushed segment name can be reconstructed from the Flink checkpoint. At this
> point the system might be in an inconsistent state. The Pinot controller
> might have already stored a newer segment (which’s name was, due to the
> failure, not persisted in the flink checkpoint).
> >>> This inconsistency is resolved with the next successful checkpoint
> creation. The there pushed segment will get the same segment name assigned
> as the inconsistent segment. Thus, Pinot replaces the old with the new
> segment which prevents introducing duplicate entries.
> >>>
> >>>
> >>> Best regards
> >>> Mats Pörschke
> >>>
> >
>

Re: Apache Pinot Sink

Reply via email to