Re: Discussion about a Flink DataSource repository

Flavio Pompermaier Thu, 05 May 2016 01:09:06 -0700

HI Fabian,
thanks for your detailed answer, as usual ;)

I think that an external service it's ok,actually I wasn't aware of the
TableSource interface.
As you said, an utility to serialize and deserialize them would be very
helpful and will ease this thing.
However, registering metadata for a table is a very common task to do.
Wouldn't be of useful for other Flink-related projects (I was thinking to
Nifi for example) to define a common minimal set of (optional) metadata to
display in a UI for a TableSource (like name, description, creationDate,
creator, field aliases)?


About point 2, I think that dataset broadcasting or closure variables are
useful when you write a program, not if you try to "compose" it using
reusable UDFs (using a script like in Pig).
Of course, the worst case scenario for us (e.g. right now) is to connect to
our repository within rich operators but I thought that it could be easy to
define a link from operators to TableEnvironment and then to TableSource
(using the lineage tag/source-id you said) and, finally to its metadata. I
don't know whether this is specific only to us, I just wanted to share our
needs and see if the table API development could benefit from them.

Best,
Flavio

On Wed, May 4, 2016 at 10:35 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Flavio,
>
> I thought a bit about your proposal. I am not sure if it is actually
> necessary to integrate a central source repository into Flink. It should be
> possible to offer this as an external service which is based on the
> recently added TableSource interface. TableSources could be extended to be
> able to serialize and descerialize their configuration to/from JSON. When
> the external repository service starts, it can read the JSON fields and
> instantiate and register TableSource objectes. The repository could also
> hold metadata about the sources and serve a (web) UI to list available
> source. When a Flink program wants to access a data source which is
> registered in the repository, it could lookup the respective TableSouce
> object from the repository.
>
> Given that an integration of metadata with Flink user functions (point 2.
> in your proposal) is a very special requirement, I am not sure how much
> "native" support should be added to Flink. Would it be possible to add a
> lineage tag to each record and ship the metadata of all sources as
> broadcast set to each operator? Then user functions could lookup the
> metadata from the broadcast set.
>
> Best, Fabian
>
> 2016-04-29 12:49 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> Hi to all,
>>
>> as discussed briefly with Fabian, for our products in Okkam we need a
>> central repository of DataSources processed by Flink.
>> With respect to existing external catalogs, such as Hive or Confluent's
>> SchemaRegistry, whose objective is to provide necessary metadata to
>> read/write the registered tables, we would also need a way to acess to
>> other general metadata (e.g. name, description, creator, creation date,
>> lastUpdate date, processedRecords, certificationLevel of provided data,
>> provenance, language, etc).
>>
>> This integration has 2 main goals:
>>
>>    1. In a UI: to enable the user to choose (or even create) a
>>    datasource to process with some task (e.g. quality assessment) and then 
>> see
>>    its metadata (name, description,  creator user, etc)
>>    2. During a Flink job: when 2 datasource gets joined and we have
>>    multiple values for an attribute (e.g. name or lastname) we can access the
>>    datasource metadata to decide which value to retain (e.g. the one coming
>>    from the most authoritative/certified source for that attribute)
>>
>> We also think that this could be of interest for projects like Apache
>> Zeppelin or Nifi enabling them to suggest to the user the sources to start
>> from.
>>
>> Do you think it makes sense to think about designing such a module for
>> Flink?
>>
>> Best,
>> Flavio
>>
>
>

Re: Discussion about a Flink DataSource repository

Reply via email to