[jira] [Commented] (KAFKA-4353) Add semantic types to Kafka Connect

Ewen Cheslack-Postava (JIRA) Mon, 07 Nov 2016 15:46:35 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645840#comment-15645840
 ]


Ewen Cheslack-Postava commented on KAFKA-4353:
----------------------------------------------

[~rhauch] Some of these make sense to me, others don't as much. UUID is an 
example that I think most programming languages have as a built-in now, so 
probably makes more sense as a native type (although interestingly, I would 
have represented it as bytes, not in string form). JSON might be a good example 
of the opposite, where if you're really intent on not passing it through 
Connect (and it'd be painful for every Converter to have to also support JSON), 
then I agree just naming the type should be enough.

There's a bit more to my concern around a large # of logical types than just 
Converters having to support them. The good thing w/ Converters is that there 
are bound to be relatively few of them, so while adding more types is annoying, 
it's not the end of the world. But if there are 40 specialized types, do we 
actually think connectors are commonly going to be able to do something useful 
with them? I just worry about having 15 different types for time since most 
systems in practice only have a couple (the fact that you're looking at CDC is 
probably why you're seeing a lot more, but there it doesn't look to me like 
there's actually a lot of overlap).

I think this is just a matter of impedance mismatch between different systems 
and how far we think it makes sense to bend over backwards to preserve as much 
info as possible vs where reasonable compromises can be made that make the 
story for Converter/Connector developers sane (and, frankly, users since once 
the data exits connect, they presumably need to understand all the types that 
can be emitted as well).

I think the idea of semantic types makes sense -- we wanted to be able to name 
types for exactly this reason (beyond even these close-to-primitive types). You 
can of course do this already with your own names, I think you're just trying 
to get coordination between source and sink connectors (and maybe other 
applications if they maintain & know to look at the schema name) since you'd 
prefer not to do this with debezium-specific names? Will all of the ones you 
listed actually make sense for applications? Take MicroTime vs NanoTime as an 
example -- they end up eating up the same storage anyway, would it make sense 
to just do it all as NanoTime (whereas MilliTimestamp and MicroTimestamp cover 
different possible ranges of time).

It might also make sense to try to get some feedback from the community as to 
which of these they'd use (and which might be missing, including logical 
types). It's a lot more compelling to hear that a dozen connectors are 
providing UUID as just a string because they don't have a named type.

> Add semantic types to Kafka Connect
> -----------------------------------
>
>                 Key: KAFKA-4353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4353
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 0.10.0.1
>            Reporter: Randall Hauch
>            Assignee: Ewen Cheslack-Postava
>
> Kafka Connect's schema system defines several _core types_ that consist of:
> * STRUCT
> * ARRAY
> * MAP
> plus these _primitive types_:
> * INT8
> * INT16
> * INT32
> * INT64
> * FLOAT32
> * FLOAT64
> * BOOLEAN
> * STRING
> * BYTES
> The {{Schema}} for these core types define several attributes, but they do 
> not have a name.
> Kafka Connect also defines several _logical types_ that are specializations 
> of the primitive types and _do_ have schema names _and_ are automatically 
> mapped to/from Java objects:
> || Schema Name || Primitive Type || Java value class || Description ||
> | o.k.c.d.Decimal | {{BYTES}} | {{java.math.BigDecimal}} | An 
> arbitrary-precision signed decimal number. |
> | o.k.c.d.Date | {{INT32}} | {{java.util.Date}} | A date representing a 
> calendar day with no time of day or timezone. The {{java.util.Date}} value's 
> hours, minutes, seconds, milliseconds are set to 0. The underlying 
> representation is an integer representing the number of standardized days 
> (based on a number of milliseconds with 24 hours/day, 60 minutes/hour, 60 
> seconds/minute, 1000 milliseconds/second with n) since Unix epoch. |
> | o.k.c.d.Time | {{INT32}} | {{java.util.Date}} | A time representing a 
> specific point in a day, not tied to any specific date. Only the 
> {{java.util.Date}} value's hours, minutes, seconds, and milliseconds can be 
> non-zero. This effectively makes it a point in time during the first day 
> after the Unix epoch. The underlying representation is an integer 
> representing the number of milliseconds after midnight. |
> | o.k.c.d.Timestamp | {{INT32}} | {{java.util.Date}} | A timestamp 
> representing an absolute time, without timezone information. The underlying 
> representation is a long representing the number of milliseconds since Unix 
> epoch. |
> where "o.k.c.d" is short for {{org.kafka.connect.data}}. [~ewencp] has stated 
> in the past that adding more logical types is challenging and generally 
> undesirable, since everyone use Kafka Connect values have to deal with all 
> new logical types.
> This proposal adds standard _semantic_ types that are somewhere between the 
> core types and logical types. Basically, they are just predefined schemas 
> that have names and are based on other primitive types. However, there is no 
> mapping to another form other than the primitive.
> The purpose of semantic types is to provide hints as to how the values _can_ 
> be treated. Of course, clients are free to ignore the hints of some or all of 
> the built-in semantic types, and in these cases would treat the values as the 
> primitive value with no extra semantics. This behavior makes it much easier 
> to add new semantic types over time without risking incompatibilities.
> Really, any source connector can define custom semantic types, but there is 
> tremendous value in having a library of standard, well-known semantic types, 
> including:
> || Schema Name || Primitive Type || Description ||
> | o.k.c.d.Uuid | {{STRING}} | A UUID in string form.|
> | o.k.c.d.Json | {{STRING}} | A JSON document, array, or scalar in string 
> form.|
> | o.k.c.d.Xml | {{STRING}} | An XML document in string form.|
> | o.k.c.d.BitSet | {{STRING}} | A string of zero or more {{0}} or {{1}} 
> characters.|
> | o.k.c.d.ZonedTime | {{STRING}} | An ISO-8601 formatted representation of a 
> time (with fractional seconds) with timezone or offset from UTC.|
> | o.k.c.d.ZonedTimestamp | {{STRING}} | An ISO-8601 formatted representation 
> of a timestamp with timezone or offset from UTC.|
> | o.k.c.d.EpochDays | {{INT64}} | A date with no time or timezone 
> information, represented as the number of days since (or before) epoch, or 
> January 1, 1970, at 00:00:00UTC.|
> | o.k.c.d.Year | {{INT32}} | The year number.|
> | o.k.c.d.MilliTime | {{INT32}} | Number of milliseconds past midnight.|
> | o.k.c.d.MicroTime | {{INT64}} | Number of microseconds past midnight.|
> | o.k.c.d.NanoTime | {{INT64}} | Number of nanoseconds past midnight.|
> | o.k.c.d.MilliTimestamp | {{INT64}} | Number of milliseconds past epoch.|
> | o.k.c.d.MicroTimestamp | {{INT64}} | Number of microseconds past epoch.|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4353) Add semantic types to Kafka Connect

Reply via email to