Re: Discussion about a Flink DataSource repository

Fabian Hueske Wed, 04 May 2016 02:32:17 -0700

Hi Flavio,

I thought a bit about your proposal. I am not sure if it is actually
necessary to integrate a central source repository into Flink. It should be
possible to offer this as an external service which is based on the
recently added TableSource interface. TableSources could be extended to be
able to serialize and descerialize their configuration to/from JSON. When
the external repository service starts, it can read the JSON fields and
instantiate and register TableSource objectes. The repository could also
hold metadata about the sources and serve a (web) UI to list available
source. When a Flink program wants to access a data source which is
registered in the repository, it could lookup the respective TableSouce
object from the repository.


Given that an integration of metadata with Flink user functions (point 2.
in your proposal) is a very special requirement, I am not sure how much
"native" support should be added to Flink. Would it be possible to add a
lineage tag to each record and ship the metadata of all sources as
broadcast set to each operator? Then user functions could lookup the
metadata from the broadcast set.

Best, Fabian

2016-04-29 12:49 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> Hi to all,
>
> as discussed briefly with Fabian, for our products in Okkam we need a
> central repository of DataSources processed by Flink.
> With respect to existing external catalogs, such as Hive or Confluent's
> SchemaRegistry, whose objective is to provide necessary metadata to
> read/write the registered tables, we would also need a way to acess to
> other general metadata (e.g. name, description, creator, creation date,
> lastUpdate date, processedRecords, certificationLevel of provided data,
> provenance, language, etc).
>
> This integration has 2 main goals:
>
>    1. In a UI: to enable the user to choose (or even create) a datasource
>    to process with some task (e.g. quality assessment) and then see its
>    metadata (name, description,  creator user, etc)
>    2. During a Flink job: when 2 datasource gets joined and we have
>    multiple values for an attribute (e.g. name or lastname) we can access the
>    datasource metadata to decide which value to retain (e.g. the one coming
>    from the most authoritative/certified source for that attribute)
>
> We also think that this could be of interest for projects like Apache
> Zeppelin or Nifi enabling them to suggest to the user the sources to start
> from.
>
> Do you think it makes sense to think about designing such a module for
> Flink?
>
> Best,
> Flavio
>

Re: Discussion about a Flink DataSource repository

Reply via email to