[jira] [Commented] (FLINK-31275) Flink supports reporting and storage of source/sink tables relationship

Maciej Obuchowski (Jira) Thu, 09 Nov 2023 06:48:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-31275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784486#comment-17784486
 ]


Maciej Obuchowski commented on FLINK-31275:
-------------------------------------------

> So, I think our current point of divergence is which level of abstraction the 
> user needs to perceive.

I would differenciate between "end user" - who just writes job code, whether in 
DataStream or SQL, listener developer and connector developer. So ideally for 
me, abstraction level for end user who just works on a job-level code is that 
they would not need to do anything besides configuring the listener and 
enjoying the lineage graph in their preferred lineage backend.

> In the current FLIP, for DataStream jobs, listener developers need to 
> identify whether the `LineageVertex` is a `KafkaSourceLineageVertex` or a 
> `JdbcLineageVertex`. You mean we need to define another layer, such as the 
> `DataSetConfig` interface, and then the listener developer can identify 
> whether it is a `KafkaDataSetConfig` or a `JdbcDataSetConfig`, right?

For listener developer, I would argue that for transmitting basic lineage - 
data source, dataset names, possibly schema and column-level lineage - 
developer should be able to get this data utilizing basic interface buildin for 
this FLIP. So, basic support would mean just recognizing `DataSetConfig` (or 
having this data in basic LineageVertex) - without any classes that strongly 
tie listener to some particular connectors. This is especially important for 
authors of generic (not only in-house) listeners, like OpenLineageListener or 
perhaps DatahubListener that would like to support lineage returned from custom 
connectors.

For connector developer, they should implement this basic interface, and then 
all implementation of listeners would be able to understand gathered lineage - 
without even knowledge of this connector.

Basically, instead of N x M problem where there are N connectors and M 
listeners and every listener has to have specific code for each connector, we 
should have single intermediate interface, so we'd save everyone's time.

Then, it would be best if there was a standard way for connectors to extend the 
returned data structure. This could be inheritance, as the FLIP suggests, but I 
think better, but maybe less type safe way would be to provide something like 
Map<String, Facet> where Facet is just a self-contained, atomic piece of 
extension metadata - things like information about output storage system, 
connector name and version, or perhaps some metrics about job execution - it's 
up for connector developer. I believe it's better, because lack of knowledge of 
particular `LineageVertex` subtype doesn't prevent you from getting lineage.

So yes, good comparison is proposed `TableLineageVertex` - I would just extend 
this concept to DataStream jobs and provide (optionally?) more metadata, with 
slightly different interface for extension.

 

I want to add that despite some disagreements on this interface, I respect the 
work you've done on this topic [~zjureel] and I believe even without 
acknowledging my points, the interface is a big step forward for better 
observability of Flink jobs.

> Flink supports reporting and storage of source/sink tables relationship
> -----------------------------------------------------------------------
>
>                 Key: FLINK-31275
>                 URL: https://issues.apache.org/jira/browse/FLINK-31275
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>    Affects Versions: 1.18.0
>            Reporter: Fang Yong
>            Assignee: Fang Yong
>            Priority: Major
>
> FLIP-314 has been accepted 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31275) Flink supports reporting and storage of source/sink tables relationship

Reply via email to