@Till Rohrmann <trohrm...@apache.org>
You are completely right that the Atlas hook itself should not live inside
Flink. All other hooks for the other projects are implemented as part of
Atlas,
and the Atlas community is ready to maintain it once we have a working
version. The discussion is more about changes that we need in Flink (if
any) to make it possible to implement the Atlas hook outside Flink.

So in theory I agree that the hook should receive job submissions and just
extract the metadata required by Atlas.

There are 2 questions here (and my initial email gives one possible
solution):

1. What is the component that receives the submission and runs the
extraction logic? If you want to remove this process from Flink you could
put something in front of the job submission rest endpoint but I think that
would be an overkill. So I assumed that we will have to add some way of
executing code on job submissions, hence my proposal of a job submission
hook.

2. What information do we need to extract the atlas metadata? On job
submissions we usually have JobGraph instances which are pretty hard to
handle compared to a StreamGraph for instance when it comes to getting
source/sink udfs. I am wondering if we need to make this easier somehow.

Gyula

On Wed, Feb 5, 2020 at 6:03 PM Taher Koitawala <taher...@gmail.com> wrote:

> As far as I know, Atlas entries can be created with a rest call. Can we not
> create an abstracted Flink operator that makes the rest call on job
> execution/submission?
>
> Regards,
> Taher Koitawala
>
> On Wed, Feb 5, 2020, 10:16 PM Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
> > Hi Gyula,
> > thanks for taking care of integrating Flink with Atlas (and Egeria
> > initiative in the end) that is IMHO the most important part of all the
> > Hadoop ecosystem and that, unfortunately, was quite overlooked. I can
> > confirm that the integration with Atlas/Egeria is absolutely of big
> > interest.
> >
> > Il Mer 5 Feb 2020, 17:12 Till Rohrmann <trohrm...@apache.org> ha
> scritto:
> >
> > > Hi Gyula,
> > >
> > > thanks for starting this discussion. Before diving in the details of
> how
> > to
> > > implement this feature, I wanted to ask whether it is strictly required
> > > that the Atlas integration lives within Flink or not? Could it also
> work
> > if
> > > you have tool which receives job submissions, extracts the required
> > > information, forwards the job submission to Flink, monitors the
> execution
> > > result and finally publishes some information to Atlas (modulo some
> other
> > > steps which are missing in my description)? Having a different layer
> > being
> > > responsible for this would keep complexity out of Flink.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Feb 5, 2020 at 12:48 PM Gyula Fóra <gyf...@apache.org> wrote:
> > >
> > > > Hi all!
> > > >
> > > > We have started some preliminary work on the Flink - Atlas
> integration
> > at
> > > > Cloudera. It seems that the integration will require some new hook
> > > > interfaces at the jobgraph generation and submission phases, so I
> > > figured I
> > > > will open a discussion thread with my initial ideas to get some early
> > > > feedback.
> > > >
> > > > *Minimal background*
> > > > Very simply put Apache Atlas is a data governance framework that
> stores
> > > > metadata for our data and processing logic to track ownership,
> lineage
> > > etc.
> > > > It is already integrated with systems like HDFS, Kafka, Hive and many
> > > > others.
> > > >
> > > > Adding Flink integration would mean that we can track the input
> output
> > > data
> > > > of our Flink jobs, their owners and how different Flink jobs are
> > > connected
> > > > to each other through the data they produce (lineage). This seems to
> > be a
> > > > very big deal for a lot of companies :)
> > > >
> > > > *Flink - Atlas integration in a nutshell*
> > > > In order to integrate with Atlas we basically need 2 things.
> > > >  - Flink entity definitions
> > > >  - Flink Atlas hook
> > > >
> > > > The entity definition is the easy part. It is a json that contains
> the
> > > > objects (entities) that we want to store for any give Flink job. As a
> > > > starter we could have a single FlinkApplication entity that has a set
> > of
> > > > inputs and outputs. These inputs/outputs are other Atlas entities
> that
> > > are
> > > > already defines such as Kafka topic or Hbase table.
> > > >
> > > > The Flink atlas hook will be the logic that creates the entity
> instance
> > > and
> > > > uploads it to Atlas when we start a new Flink job. This is the part
> > where
> > > > we implement the core logic.
> > > >
> > > > *Job submission hook*
> > > > In order to implement the Atlas hook we need a place where we can
> > inspect
> > > > the pipeline, create and send the metadata when the job starts. When
> we
> > > > create the FlinkApplication entity we need to be able to easily
> > determine
> > > > the sources and sinks (and their properties) of the pipeline.
> > > >
> > > > Unfortunately there is no JobSubmission hook in Flink that could
> > execute
> > > > this logic and even if there was one there is a mismatch of
> abstraction
> > > > levels needed to implement the integration.
> > > > We could imagine a JobSubmission hook executed in the JobManager
> runner
> > > as
> > > > this:
> > > >
> > > > void onSuccessfulSubmission(JobGraph jobGraph, Configuration
> > > > configuration);
> > > >
> > > > This is nice but the JobGraph makes it super difficult to extract
> > sources
> > > > and UDFs to create the metadata entity. The atlas entity however
> could
> > be
> > > > easily created from the StreamGraph object (used to represent the
> > logical
> > > > flow) before the JobGraph is generated. To go around this limitation
> we
> > > > could add a JobGraphGeneratorHook interface:
> > > >
> > > > void preProcess(StreamGraph streamGraph); void postProcess(JobGraph
> > > > jobGraph);
> > > >
> > > > We could then generate the atlas entity in the preprocess step and
> add
> > a
> > > > jobmission hook in the postprocess step that will simply send the
> > already
> > > > baked in entity.
> > > >
> > > > *This kinda works but...*
> > > > The approach outlined above seems to work and we have built a POC
> using
> > > it.
> > > > Unfortunately it is far from nice as it exposes non-public APIs such
> as
> > > the
> > > > StreamGraph. Also it feels a bit weird to have 2 hooks instead of
> one.
> > > >
> > > > It would be much nicer if we could somehow go back from JobGraph to
> > > > StreamGraph or at least have an easy way to access source/sink UDFS.
> > > >
> > > > What do you think?
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > >
> >
>

Reply via email to