[GitHub] [flink] wuchong commented on a change in pull request #13909: [FLINK-14356][table][formats] Introduce "single-field" format to (de)serialize message to a single field

GitBox Thu, 05 Nov 2020 07:05:44 -0800


wuchong commented on a change in pull request #13909:
URL: https://github.com/apache/flink/pull/13909#discussion_r518120299




##########
File path: docs/dev/table/connectors/formats/single-field.md
##########
@@ -0,0 +1,151 @@
+---
+title: "Single Field Format"
+nav-title: SingleField
+nav-parent_id: sql-formats
+nav-pos: 7
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<span class="label label-info">Format: Serialization Schema</span>
+<span class="label label-info">Format: Deserialization Schema</span>
+
+* This will be replaced by the TOC
+{:toc}
+
+The SingleField format allows to read and write data where the data contains 
only a single field, and that field is not wrapped within a JSON object, or an 
Avro record.
+
+Currently, the SingleField format supports `String`, `byte[]` and primitive 
type.
+
+Note: this format encodes `null` values as `null` `byte[]`. This may have 
limitation when used in `upsert-kafka`, because `upsert-kafka` treats `null` 
values as a tombstone message (DELETE on the key). Therefore, we recommend 
avoiding using `upsert-kafka` connector and `single-field` format if the field 
can have a `null` value.
+
+Example
+----------------
+
+For example, you may have following raw log data in Kafka and want to read and 
analyse such data using Flink SQL.
+
+```
+47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 
"https://domain.com/?p=1"; "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
+```
+
+The following creates a table where it reads from (and writes to) the 
underlying Kafka topic as an anonymous string value by using `single-field` 
format:
+
+<div class="codetabs" markdown="1">
+<div data-lang="SQL" markdown="1">
+{% highlight sql %}
+CREATE TABLE nginx_log (
+  log STRING
+) WITH (
+  'connector' = 'kafka',
+  'topic' = 'nginx_log',
+  'properties.bootstrap.servers' = 'localhost:9092',
+  'properties.group.id' = 'testGroup',
+  'format' = 'single-field'
+)
+{% endhighlight %}
+</div>
+</div>
+
+Then you can read out the raw data as a pure string, and split it into 
multiple fields using user-defined-function for further analysing, e.g. 
`my_split` in the example.
+
+<div class="codetabs" markdown="1">
+<div data-lang="SQL" markdown="1">
+{% highlight sql %}
+SELECT t.hostname, t.datetime, t.url, t.browser, ...
+FROM(
+  SELECT my_split(log) as t FROM nginx_log
+);
+{% endhighlight %}
+</div>
+</div>
+
+In contrast, you can also write a single field of STRING type into Kafka topic 
as an anonymous string value.
+
+Format Options
+----------------
+
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left" style="width: 25%">Option</th>
+        <th class="text-center" style="width: 8%">Required</th>
+        <th class="text-center" style="width: 7%">Default</th>
+        <th class="text-center" style="width: 10%">Type</th>
+        <th class="text-center" style="width: 50%">Description</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td><h5>format</h5></td>
+      <td>required</td>
+      <td style="word-wrap: break-word;">(none)</td>
+      <td>String</td>
+      <td>Specify what format to use, here should be 'single-field'.</td>
+    </tr>
+    </tbody>
+</table>
+
+Data Type Mapping
+----------------
+
+The table below details the SQL types the format supports, including details 
of the serializer and deserializer class for encoding and decoding.
+
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left">Flink SQL type</th>
+        <th class="text-left">Value</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td><code>CHAR / VARCHAR / STRING</code></td>
+      <td>A UTF-8 encoded text string</td>
+    </tr>
+    <tr>
+      <td><code>BOOLEAN</code></td>
+      <td>A single byte to indicate boolean value, 0 means false, 1 means 
true.</td>
+    </tr>
+    <tr>
+      <td><code>TINYINT</code></td>
+      <td>A 8-bit signed number</td>
+    </tr>
+    <tr>
+      <td><code>SMALLINT</code></td>
+      <td>A 16-bit signed number</td>
+    </tr>
+    <tr>
+      <td><code>INT</code></td>
+      <td>A 32-bit signed integer</td>
+    </tr>
+    <tr>
+      <td><code>BIGINT</code></td>
+      <td>A 64-bit signed integer</td>
+    </tr>
+    <tr>
+      <td><code>FLOAT</code></td>
+      <td>A 32-bit floating point number</td>
+    </tr>
+    <tr>

Review comment:
       I'm not sure about `RAW` type, because currently it's hard to declare 
RAW type in DDL.
   As an alternative, users can declare it as `BYTES` and use UDF to 
deserialize the bytes. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] wuchong commented on a change in pull request #13909: [FLINK-14356][table][formats] Introduce "single-field" format to (de)serialize message to a single field

Reply via email to