wuchong commented on a change in pull request #13909: URL: https://github.com/apache/flink/pull/13909#discussion_r518120299
########## File path: docs/dev/table/connectors/formats/single-field.md ########## @@ -0,0 +1,151 @@ +--- +title: "Single Field Format" +nav-title: SingleField +nav-parent_id: sql-formats +nav-pos: 7 +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +<span class="label label-info">Format: Serialization Schema</span> +<span class="label label-info">Format: Deserialization Schema</span> + +* This will be replaced by the TOC +{:toc} + +The SingleField format allows to read and write data where the data contains only a single field, and that field is not wrapped within a JSON object, or an Avro record. + +Currently, the SingleField format supports `String`, `byte[]` and primitive type. + +Note: this format encodes `null` values as `null` `byte[]`. This may have limitation when used in `upsert-kafka`, because `upsert-kafka` treats `null` values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using `upsert-kafka` connector and `single-field` format if the field can have a `null` value. + +Example +---------------- + +For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL. + +``` +47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75" +``` + +The following creates a table where it reads from (and writes to) the underlying Kafka topic as an anonymous string value by using `single-field` format: + +<div class="codetabs" markdown="1"> +<div data-lang="SQL" markdown="1"> +{% highlight sql %} +CREATE TABLE nginx_log ( + log STRING +) WITH ( + 'connector' = 'kafka', + 'topic' = 'nginx_log', + 'properties.bootstrap.servers' = 'localhost:9092', + 'properties.group.id' = 'testGroup', + 'format' = 'single-field' +) +{% endhighlight %} +</div> +</div> + +Then you can read out the raw data as a pure string, and split it into multiple fields using user-defined-function for further analysing, e.g. `my_split` in the example. + +<div class="codetabs" markdown="1"> +<div data-lang="SQL" markdown="1"> +{% highlight sql %} +SELECT t.hostname, t.datetime, t.url, t.browser, ... +FROM( + SELECT my_split(log) as t FROM nginx_log +); +{% endhighlight %} +</div> +</div> + +In contrast, you can also write a single field of STRING type into Kafka topic as an anonymous string value. + +Format Options +---------------- + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left" style="width: 25%">Option</th> + <th class="text-center" style="width: 8%">Required</th> + <th class="text-center" style="width: 7%">Default</th> + <th class="text-center" style="width: 10%">Type</th> + <th class="text-center" style="width: 50%">Description</th> + </tr> + </thead> + <tbody> + <tr> + <td><h5>format</h5></td> + <td>required</td> + <td style="word-wrap: break-word;">(none)</td> + <td>String</td> + <td>Specify what format to use, here should be 'single-field'.</td> + </tr> + </tbody> +</table> + +Data Type Mapping +---------------- + +The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding. + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left">Flink SQL type</th> + <th class="text-left">Value</th> + </tr> + </thead> + <tbody> + <tr> + <td><code>CHAR / VARCHAR / STRING</code></td> + <td>A UTF-8 encoded text string</td> + </tr> + <tr> + <td><code>BOOLEAN</code></td> + <td>A single byte to indicate boolean value, 0 means false, 1 means true.</td> + </tr> + <tr> + <td><code>TINYINT</code></td> + <td>A 8-bit signed number</td> + </tr> + <tr> + <td><code>SMALLINT</code></td> + <td>A 16-bit signed number</td> + </tr> + <tr> + <td><code>INT</code></td> + <td>A 32-bit signed integer</td> + </tr> + <tr> + <td><code>BIGINT</code></td> + <td>A 64-bit signed integer</td> + </tr> + <tr> + <td><code>FLOAT</code></td> + <td>A 32-bit floating point number</td> + </tr> + <tr> Review comment: I'm not sure about `RAW` type, because currently it's hard to declare RAW type in DDL. As an alternative, users can declare it as `BYTES` and use UDF to deserialize the bytes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org