Randall Hauch created KAFKA-4353:
------------------------------------

             Summary: Add semantic types to Kafka Connect
                 Key: KAFKA-4353
                 URL: https://issues.apache.org/jira/browse/KAFKA-4353
             Project: Kafka
          Issue Type: Improvement
          Components: KafkaConnect
    Affects Versions: 0.10.0.1
            Reporter: Randall Hauch
            Assignee: Ewen Cheslack-Postava


Kafka Connect's schema system defines several _core types_ that consist of:

* STRUCT
* ARRAY
* MAP

plus these _primitive types_:

* INT8
* INT16
* INT32
* INT64
* FLOAT32
* FLOAT64
* BOOLEAN
* STRING
* BYTES

The {{Schema}} for these core types define several attributes, but they do not 
have a name.

Kafka Connect also defines several _logical types_ that are specializations of 
the primitive types and _do_ have schema names _and_ are automatically mapped 
to/from Java objects:

|| Schema Name || Primitive Type || Java value class || Description ||
| o.k.c.d.Decimal | {{BYTES}} | {{java.math.BigDecimal}} | An 
arbitrary-precision signed decimal number. |
| o.k.c.d.Date | {{INT32}} | {{java.util.Date}} | A date representing a 
calendar day with no time of day or timezone. The {{java.util.Date}} value's 
hours, minutes, seconds, milliseconds are set to 0. The underlying 
representation is an integer representing the number of standardized days 
(based on a number of milliseconds with 24 hours/day, 60 minutes/hour, 60 
seconds/minute, 1000 milliseconds/second with n) since Unix epoch. |
| o.k.c.d.Time | {{INT32}} | {{java.util.Date}} | A time representing a 
specific point in a day, not tied to any specific date. Only the 
{{java.util.Date}} value's hours, minutes, seconds, and milliseconds can be 
non-zero. This effectively makes it a point in time during the first day after 
the Unix epoch. The underlying representation is an integer representing the 
number of milliseconds after midnight. |
| o.k.c.d.Timestamp | {{INT32}} | {{java.util.Date}} | A timestamp representing 
an absolute time, without timezone information. The underlying representation 
is a long representing the number of milliseconds since Unix epoch. |

where "o.k.c.d" is short for {{org.kafka.connect.data}}. [~ewencp] has stated 
in the past that adding more logical types is challenging and generally 
undesirable, since everyone use Kafka Connect values have to deal with all new 
logical types.

This proposal adds standard _semantic_ types that are somewhere between the 
core types and logical types. Basically, they are just predefined schemas that 
have names and are based on other primitive types. However, there is no mapping 
to another form other than the primitive.

The purpose of semantic types is to provide hints as to how the values _can_ be 
treated. Of course, clients are free to ignore the hints of some or all of the 
built-in semantic types, and in these cases would treat the values as the 
primitive value with no extra semantics. This behavior makes it much easier to 
add new semantic types over time without risking incompatibilities.

Really, any source connector can define custom semantic types, but there is 
tremendous value in having a library of standard, well-known semantic types, 
including:

|| Schema Name || Primitive Type || Description ||
| o.k.c.d.Uuid | {{STRING}} | A UUID in string form.|
| o.k.c.d.Json | {{STRING}} | A JSON document, array, or scalar in string form.|
| o.k.c.d.Xml | {{STRING}} | An XML document in string form.|
| o.k.c.d.BitSet | {{STRING}} | A string of zero or more {{0}} or {{1}} 
characters.|
| o.k.c.d.ZonedTime | {{STRING}} | An ISO-8601 formatted representation of a 
time (with fractional seconds) with timezone or offset from UTC.|
| o.k.c.d.ZonedTimestamp | {{STRING}} | An ISO-8601 formatted representation of 
a timestamp with timezone or offset from UTC.|
| o.k.c.d.EpochDays | {{INT64}} | A date with no time or timezone information, 
represented as the number of days since (or before) epoch, or January 1, 1970, 
at 00:00:00UTC.|
| o.k.c.d.Year | {{INT32}} | The year number.|
| o.k.c.d.MilliTime | {{INT32}} | Number of milliseconds past midnight.|
| o.k.c.d.MicroTime | {{INT64}} | Number of microseconds past midnight.|
| o.k.c.d.NanoTime | {{INT64}} | Number of nanoseconds past midnight.|
| o.k.c.d.MilliTimestamp | {{INT64}} | Number of milliseconds past epoch.|
| o.k.c.d.MicroTimestamp | {{INT64}} | Number of microseconds past epoch.|





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to