[ https://issues.apache.org/jira/browse/FLINK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295353#comment-16295353 ]
Timo Walther commented on FLINK-8240: ------------------------------------- Hi everyone, I think we don't need a design document for it but it would be great to hear some opinions. I introduced descriptors that allow to describe connectors, encoding, and time attributes. My current API design looks like: {code} tableEnv .from( FileSystem() .path("/path/to/csv")) .withEncoding( CSV() .field("myfield", Types.STRING) .field("myfield2", Types.INT) .quoteCharacter(';') .fieldDelimiter("#") .lineDelimiter("\r\n") .commentPrefix("%%") .ignoreFirstLine() .ignoreParseErrors()) .withRowtime( Rowtime() .onField("rowtime") .withTimestampFromDataStream() .withWatermarkFromDataStream()) .withProctime( Proctime() .onField("myproctime")) .toTableSource() {code} These descriptors are converted into pure key-value properties. Such as: {code} "connector.filesystem.path" -> "/myfile" "encoding.csv.fields.0.name" -> "field1", "encoding.csv.fields.0.type" -> "STRING", "encoding.csv.fields.1.name" -> "field2", "encoding.csv.fields.1.type" -> "TIMESTAMP", "encoding.csv.fields.2.name" -> "field3", "encoding.csv.fields.2.type" -> "ANY(java.lang.Class)", "encoding.csv.fields.3.name" -> "field4", "encoding.csv.fields.3.type" -> "ROW(test INT, row VARCHAR)", "encoding.csv.line-delimiter" -> "^" {code} The properties are fully expressed as strings. This allows to save them also in configuration files. Which might be interesting for FLINK-7594. The question is how do we want to translate the properties into actual table sources. Or more precisely: How do we want to supply converters? Should they be part of the {{TableSource}} interface? Or should table sources be annotated with some factory class? Right now we have a similar functionality for external catalogs but this is too specific and does not consider encodings or time attributes. Furthermore, it would be better to use Java {{ServiceLoader}}s instead of classpath scanning. This is also used for Flink's file systems. So my idea would be to have a class {{TableFactory}} that declares a connector e.g. "kafka_0.10" and supported encodings "csv", "avro" (similar to FLINK-7643). All built-in table sources need to provide such a factory. What do you think? [~fhueske] [~jark] [~wheat9] [~ykt836] > Create unified interfaces to configure and instatiate TableSources > ------------------------------------------------------------------ > > Key: FLINK-8240 > URL: https://issues.apache.org/jira/browse/FLINK-8240 > Project: Flink > Issue Type: New Feature > Components: Table API & SQL > Reporter: Timo Walther > Assignee: Timo Walther > > At the moment every table source has different ways for configuration and > instantiation. Some table source are tailored to a specific encoding (e.g., > {{KafkaAvroTableSource}}, {{KafkaJsonTableSource}}) or only support one > encoding for reading (e.g., {{CsvTableSource}}). Each of them might implement > a builder or support table source converters for external catalogs. > The table sources should have a unified interface for discovery, defining > common properties, and instantiation. The {{TableSourceConverters}} provide a > similar functionality but use an external catalog. We might generialize this > interface. > In general a table source declaration depends on the following parts: > {code} > - Source > - Type (e.g. Kafka, Custom) > - Properties (e.g. topic, connection info) > - Encoding > - Type (e.g. Avro, JSON, CSV) > - Schema (e.g. Avro class, JSON field names/types) > - Rowtime descriptor/Proctime > - Watermark strategy and Watermark properties > - Time attribute info > - Bucketization > {code} > This issue needs a design document before implementation. Any discussion is > very welcome. -- This message was sent by Atlassian JIRA (v6.4.14#64029)