Discussion about a Flink DataSource repository

Flavio Pompermaier Fri, 29 Apr 2016 03:50:36 -0700

Hi to all,

as discussed briefly with Fabian, for our products in Okkam we need a
central repository of DataSources processed by Flink.
With respect to existing external catalogs, such as Hive or Confluent's
SchemaRegistry, whose objective is to provide necessary metadata to
read/write the registered tables, we would also need a way to acess to
other general metadata (e.g. name, description, creator, creation date,
lastUpdate date, processedRecords, certificationLevel of provided data,
provenance, language, etc).


This integration has 2 main goals:

   1. In a UI: to enable the user to choose (or even create) a datasource
   to process with some task (e.g. quality assessment) and then see its
   metadata (name, description,  creator user, etc)
   2. During a Flink job: when 2 datasource gets joined and we have
   multiple values for an attribute (e.g. name or lastname) we can access the
   datasource metadata to decide which value to retain (e.g. the one coming
   from the most authoritative/certified source for that attribute)

We also think that this could be of interest for projects like Apache
Zeppelin or Nifi enabling them to suggest to the user the sources to start
from.

Do you think it makes sense to think about designing such a module for
Flink?

Best,
Flavio

Discussion about a Flink DataSource repository

Reply via email to