eldenmoon commented on code in PR #28951: URL: https://github.com/apache/doris/pull/28951#discussion_r1451960906
########## docs/zh-CN/docs/sql-manual/sql-reference/Data-Types/VARIANT.md: ########## @@ -0,0 +1,347 @@ +--- +{ + "title": "VARIANT", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## VARIANT + +### Description + +VARIANT类型 +在 Doris 2.1 中引入一种新的数据类型 VARIANT,它可以存储半结构化 JSON 数据。它允许存储包含不同数据类型(如整数、字符串、布尔值等)的复杂数据结构,而无需在表结构中提前定义具体的列。VARIANT 类型特别适用于处理复杂的嵌套结构,而这些结构可能随时会发生变化。在写入过程中,该类型可以自动根据列的结构、类型推断列信息,并将其合并到现有表的 Schema 中。通过将 JSON 键及其对应的值存储为列和动态子列。 + +### Note + +相比 JSON 类型有有以下优势: + +1. 存储方式不同, JSON 类型是以二进制 JSONB 格式进行存储,整行 JSON 以行存的形式存储到 segment 文件中。而 VARIANT 类型在写入的时候进行类型推断,将写入的 JSON 列存化。比JSON类型有更高的压缩比, 存储空间更好。 +2. 查询方式同步,查询不需要进行解析。VARIANT 充分利用 Doris 中列式存储、向量化引擎、优化器等组件给用户带来极高的查询性能。 +下面是基于 clickbench 数据测试的结果: + +| | 存储空间 | +|--------------|------------| +| 预定义静态列 | 24.329 GB | +| variant 类型 | 24.296 GB | +| json 类型 | 46.730 GB | + + + +**节省约 50%存储容量** + +| 查询次数 | 预定义静态列 | variant 类型 | json 类型 | +|----------------|--------------|--------------|-----------------| +| 第一次查询 (cold) | 233.79s | 248.66s | **大部分查询超时** | +| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s | +| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s | + +**查询提速 8+倍, 查询性能与静态列相当** + +### Example + +``` +用一个从建表、导数据、查询全周期的例子说明VARIANT的功能和用法。 +``` + +**建表语法** + +建表,建表语法关键字 variant + +``` sql +-- 无索引 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT +) +table_properties; + +-- 在v列创建索引,可选指定分词方式,默认不分词 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT, + INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment'] +) +table_properties; +``` + +**查询语法** + +``` sql +-- 1. 查询方式1, 查询语法稍有不同,v 后使用:分割,.访问子列,例如 +SELECT v:`properties`.`title` from ${table_name} +-- 2. 查询方式2, 使用 v['a']['b'] 形式例如 +SELECT v["properties"]["title"] from ${table_name} +``` + +### 基于 github events 数据集示例 + +这里用 github events 数据展示 variant 的建表、导入、查询。 +下面是格式化后的一行数据 + +``` json +{ + "id": "14186154924", + "type": "PushEvent", + "actor": { + "id": 282080, + "login": "brianchandotcom", + "display_login": "brianchandotcom", + "gravatar_id": "", + "url": "https://api.github.com/users/brianchandotcom", + "avatar_url": "https://avatars.githubusercontent.com/u/282080?" + }, + "repo": { + "id": 1920851, + "name": "brianchandotcom/liferay-portal", + "url": "https://api.github.com/repos/brianchandotcom/liferay-portal" + }, + "payload": { + "push_id": 6027092734, + "size": 4, + "distinct_size": 4, + "ref": "refs/heads/master", + "head": "91edd3c8c98c214155191feb852831ec535580ba", + "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d", + "commits": [""] + }, + "public": true, + "created_at": "2020-11-13T18:00:00Z" +} +``` + +**建表** + +- 创建了三个 VARIANT 类型的列, `actor`,`repo` 和 `payload` +- 创建表的同时创建了 `payload` 列的倒排索引 `idx_payload` +- USING INVERTED 指定索引类型是倒排索引,用于加速子列的条件过滤 +- `PROPERTIES("parser" = "english")` 指定采用 english 分词 + +``` sql +CREATE DATABASE test_variant; +USE test_variant; +CREATE TABLE IF NOT EXISTS github_events ( + id BIGINT NOT NULL, + type VARCHAR(30) NULL, + actor VARIANT NULL, + repo VARIANT NULL, + payload VARIANT NULL, + public BOOLEAN NULL, + created_at DATETIME NULL, + INDEX idx_payload (`payload`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for payload' +) +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(id) BUCKETS 10 +properties("replication_num" = "1"); +``` + +**需要注意的是:** + +::: tip + +1. 在 VARIANT 列上创建索引,比如 payload 的子列很多时,可能会造成索引列过多,影响写入性能 +2. 同一个 VARIANT 列的分词属性是相同的,如果您有不同的分词需求,那么可以创建多个 VARIANT 然后分别指定索引属性 + +::: + +**使用 streamload 导入** + +导入gh_2022-11-07-3.json,这是 github events 一个小时的数据 + +``` shell +wget http://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/variant/gh_2022-11-07-3.json + +curl --location-trusted -u root: -T gh_2022-11-07-3.json -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:18148/api/test_variant/github_events/_strea +m_load + +{ + "TxnId": 2, + "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf", + "Comment": "", + "TwoPhaseCommit": "false", + "Status": "Success", + "Message": "OK", + "NumberTotalRows": 139325, + "NumberLoadedRows": 139325, + "NumberFilteredRows": 0, + "NumberUnselectedRows": 0, + "LoadBytes": 633782875, + "LoadTimeMs": 7870, + "BeginTxnTimeMs": 19, + "StreamLoadPutTimeMs": 162, + "ReadDataTimeMs": 2416, + "WriteDataTimeMs": 7634, + "CommitAndPublishTimeMs": 55 +} +``` + +确认导入成功 + +``` sql +-- 查看行数 +mysql> select count() from github_events; ++----------+ +| count(*) | ++----------+ +| 139325 | ++----------+ +1 row in set (0.25 sec) + +-- 随机看一条数据 +mysql> select * from github_events limit 1; ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| id | type | actor | repo | payload | public | created_at | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| 25061821748 | PushEvent | {"gravatar_id":"","display_login":"jfrog-pipelie-intg","url":"https://api.github.com/users/jfrog-pipelie-intg","id":98024358,"login":"jfrog-pipelie-intg","avatar_url":"https://avatars.githubusercontent.com/u/98024358?"} | {"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16","id":562683829,"name":"jfrog-pipelie-intg/jfinte2e_1667789956723_16"} | {"commits":[{"sha":"334433de436baa198024ef9f55f0647721bcd750","author":{"email":"98024358+jfrog-pipelie-i...@users.noreply.github.com","name":"jfrog-pipelie-intg"},"message":"commit message 10238493157623136117","distinct":true,"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16/commits/334433de436baa198024ef9f55f0647721bcd750"}],"before":"f84a26792f44d54305ddd41b7e3a79d25b1a9568","head":"334433de436baa198024ef9f55f0647721bcd750","size":1,"push_id":11572649828,"ref":"refs/heads/test-notification-sent-branch-10238493157623136113","distinct_size":1} | 1 | 2022-11-07 11:00:00 | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +1 row in set (0.23 sec) +``` + +desc 查看 schema 信息,子列会在存储层自动扩展、并进行类型推导 + +``` sql +mysql> set describe_extend_variant_column = true; +Query OK, 0 rows affected (0.01 sec) + +mysql> desc github_events; ++------------------------------------------------------------+------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------------------------------------------------------+------------+------+-------+---------+-------+ +| id | BIGINT | No | true | NULL | | +| type | VARCHAR(*) | Yes | false | NULL | NONE | +| actor | VARIANT | Yes | false | NULL | NONE | +| actor.avatar_url | TEXT | Yes | false | NULL | NONE | +| actor.display_login | TEXT | Yes | false | NULL | NONE | +| actor.id | INT | Yes | false | NULL | NONE | +| actor.login | TEXT | Yes | false | NULL | NONE | +| actor.url | TEXT | Yes | false | NULL | NONE | +| created_at | DATETIME | Yes | false | NULL | NONE | +| payload | VARIANT | Yes | false | NULL | NONE | +| payload.action | TEXT | Yes | false | NULL | NONE | +| payload.before | TEXT | Yes | false | NULL | NONE | +| payload.comment.author_association | TEXT | Yes | false | NULL | NONE | +| payload.comment.body | TEXT | Yes | false | NULL | NONE | +.... ++------------------------------------------------------------+------------+------+-------+---------+-------+ +406 rows in set (0.07 sec) +``` + +desc 可以指定 partition 查看某个 partition 的 schema, 语法如下 + +``` +DESCRIBE ${table_name} PARTITION ($partition_name); +``` + +**查询** + +::: warning + +**注意** +如使用过滤和聚合等功能来查询子列, 需要对子列进行额外的 cast 操作(因为存储类型不一定是固定的,需要有一个 SQL 统一的类型)。 +例如 SELECT * FROM tbl where CAST(var["titile"] as text) MATCH "hello world" +以下简化的示例说明了如何使用 VARIANT 进行查询 + +::: + +下面是典型的三个查询场景: + +1. 从 github_events 表中获取 top 5 star 数的代码库 + +``` sql +mysql> SELECT + -> cast(repo["name"] as text), count() AS stars + -> FROM github_events + -> WHERE type = 'WatchEvent' + -> GROUP BY cast(repo["name"] as text) + -> ORDER BY stars DESC LIMIT 5; ++--------------------------+-------+ +| CAST(`repo` AS TEXT) | stars | Review Comment: done ########## docs/zh-CN/docs/sql-manual/sql-reference/Data-Types/VARIANT.md: ########## @@ -0,0 +1,347 @@ +--- +{ + "title": "VARIANT", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## VARIANT + +### Description + +VARIANT类型 +在 Doris 2.1 中引入一种新的数据类型 VARIANT,它可以存储半结构化 JSON 数据。它允许存储包含不同数据类型(如整数、字符串、布尔值等)的复杂数据结构,而无需在表结构中提前定义具体的列。VARIANT 类型特别适用于处理复杂的嵌套结构,而这些结构可能随时会发生变化。在写入过程中,该类型可以自动根据列的结构、类型推断列信息,并将其合并到现有表的 Schema 中。通过将 JSON 键及其对应的值存储为列和动态子列。 + +### Note + +相比 JSON 类型有有以下优势: + +1. 存储方式不同, JSON 类型是以二进制 JSONB 格式进行存储,整行 JSON 以行存的形式存储到 segment 文件中。而 VARIANT 类型在写入的时候进行类型推断,将写入的 JSON 列存化。比JSON类型有更高的压缩比, 存储空间更好。 +2. 查询方式同步,查询不需要进行解析。VARIANT 充分利用 Doris 中列式存储、向量化引擎、优化器等组件给用户带来极高的查询性能。 +下面是基于 clickbench 数据测试的结果: + +| | 存储空间 | +|--------------|------------| +| 预定义静态列 | 24.329 GB | +| variant 类型 | 24.296 GB | +| json 类型 | 46.730 GB | + + + +**节省约 50%存储容量** + +| 查询次数 | 预定义静态列 | variant 类型 | json 类型 | +|----------------|--------------|--------------|-----------------| +| 第一次查询 (cold) | 233.79s | 248.66s | **大部分查询超时** | +| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s | +| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s | + +**查询提速 8+倍, 查询性能与静态列相当** + +### Example + +``` +用一个从建表、导数据、查询全周期的例子说明VARIANT的功能和用法。 +``` + +**建表语法** + +建表,建表语法关键字 variant + +``` sql +-- 无索引 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT +) +table_properties; + +-- 在v列创建索引,可选指定分词方式,默认不分词 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT, + INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment'] +) +table_properties; +``` + +**查询语法** + +``` sql +-- 1. 查询方式1, 查询语法稍有不同,v 后使用:分割,.访问子列,例如 +SELECT v:`properties`.`title` from ${table_name} +-- 2. 查询方式2, 使用 v['a']['b'] 形式例如 +SELECT v["properties"]["title"] from ${table_name} +``` + +### 基于 github events 数据集示例 + +这里用 github events 数据展示 variant 的建表、导入、查询。 +下面是格式化后的一行数据 + +``` json +{ + "id": "14186154924", + "type": "PushEvent", + "actor": { + "id": 282080, + "login": "brianchandotcom", + "display_login": "brianchandotcom", + "gravatar_id": "", + "url": "https://api.github.com/users/brianchandotcom", + "avatar_url": "https://avatars.githubusercontent.com/u/282080?" + }, + "repo": { + "id": 1920851, + "name": "brianchandotcom/liferay-portal", + "url": "https://api.github.com/repos/brianchandotcom/liferay-portal" + }, + "payload": { + "push_id": 6027092734, + "size": 4, + "distinct_size": 4, + "ref": "refs/heads/master", + "head": "91edd3c8c98c214155191feb852831ec535580ba", + "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d", + "commits": [""] + }, + "public": true, + "created_at": "2020-11-13T18:00:00Z" +} +``` + +**建表** + +- 创建了三个 VARIANT 类型的列, `actor`,`repo` 和 `payload` +- 创建表的同时创建了 `payload` 列的倒排索引 `idx_payload` +- USING INVERTED 指定索引类型是倒排索引,用于加速子列的条件过滤 +- `PROPERTIES("parser" = "english")` 指定采用 english 分词 + +``` sql +CREATE DATABASE test_variant; +USE test_variant; +CREATE TABLE IF NOT EXISTS github_events ( + id BIGINT NOT NULL, + type VARCHAR(30) NULL, + actor VARIANT NULL, + repo VARIANT NULL, + payload VARIANT NULL, + public BOOLEAN NULL, + created_at DATETIME NULL, + INDEX idx_payload (`payload`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for payload' +) +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(id) BUCKETS 10 +properties("replication_num" = "1"); +``` + +**需要注意的是:** + +::: tip + +1. 在 VARIANT 列上创建索引,比如 payload 的子列很多时,可能会造成索引列过多,影响写入性能 +2. 同一个 VARIANT 列的分词属性是相同的,如果您有不同的分词需求,那么可以创建多个 VARIANT 然后分别指定索引属性 + +::: + +**使用 streamload 导入** + +导入gh_2022-11-07-3.json,这是 github events 一个小时的数据 + +``` shell +wget http://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/variant/gh_2022-11-07-3.json + +curl --location-trusted -u root: -T gh_2022-11-07-3.json -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:18148/api/test_variant/github_events/_strea +m_load + +{ + "TxnId": 2, + "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf", + "Comment": "", + "TwoPhaseCommit": "false", + "Status": "Success", + "Message": "OK", + "NumberTotalRows": 139325, + "NumberLoadedRows": 139325, + "NumberFilteredRows": 0, + "NumberUnselectedRows": 0, + "LoadBytes": 633782875, + "LoadTimeMs": 7870, + "BeginTxnTimeMs": 19, + "StreamLoadPutTimeMs": 162, + "ReadDataTimeMs": 2416, + "WriteDataTimeMs": 7634, + "CommitAndPublishTimeMs": 55 +} +``` + +确认导入成功 + +``` sql +-- 查看行数 +mysql> select count() from github_events; ++----------+ +| count(*) | ++----------+ +| 139325 | ++----------+ +1 row in set (0.25 sec) + +-- 随机看一条数据 +mysql> select * from github_events limit 1; ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| id | type | actor | repo | payload | public | created_at | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| 25061821748 | PushEvent | {"gravatar_id":"","display_login":"jfrog-pipelie-intg","url":"https://api.github.com/users/jfrog-pipelie-intg","id":98024358,"login":"jfrog-pipelie-intg","avatar_url":"https://avatars.githubusercontent.com/u/98024358?"} | {"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16","id":562683829,"name":"jfrog-pipelie-intg/jfinte2e_1667789956723_16"} | {"commits":[{"sha":"334433de436baa198024ef9f55f0647721bcd750","author":{"email":"98024358+jfrog-pipelie-i...@users.noreply.github.com","name":"jfrog-pipelie-intg"},"message":"commit message 10238493157623136117","distinct":true,"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16/commits/334433de436baa198024ef9f55f0647721bcd750"}],"before":"f84a26792f44d54305ddd41b7e3a79d25b1a9568","head":"334433de436baa198024ef9f55f0647721bcd750","size":1,"push_id":11572649828,"ref":"refs/heads/test-notification-sent-branch-10238493157623136113","distinct_size":1} | 1 | 2022-11-07 11:00:00 | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +1 row in set (0.23 sec) +``` + +desc 查看 schema 信息,子列会在存储层自动扩展、并进行类型推导 + +``` sql +mysql> set describe_extend_variant_column = true; +Query OK, 0 rows affected (0.01 sec) + +mysql> desc github_events; ++------------------------------------------------------------+------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------------------------------------------------------+------------+------+-------+---------+-------+ +| id | BIGINT | No | true | NULL | | +| type | VARCHAR(*) | Yes | false | NULL | NONE | +| actor | VARIANT | Yes | false | NULL | NONE | +| actor.avatar_url | TEXT | Yes | false | NULL | NONE | +| actor.display_login | TEXT | Yes | false | NULL | NONE | +| actor.id | INT | Yes | false | NULL | NONE | +| actor.login | TEXT | Yes | false | NULL | NONE | +| actor.url | TEXT | Yes | false | NULL | NONE | +| created_at | DATETIME | Yes | false | NULL | NONE | +| payload | VARIANT | Yes | false | NULL | NONE | +| payload.action | TEXT | Yes | false | NULL | NONE | +| payload.before | TEXT | Yes | false | NULL | NONE | +| payload.comment.author_association | TEXT | Yes | false | NULL | NONE | +| payload.comment.body | TEXT | Yes | false | NULL | NONE | +.... ++------------------------------------------------------------+------------+------+-------+---------+-------+ +406 rows in set (0.07 sec) +``` + +desc 可以指定 partition 查看某个 partition 的 schema, 语法如下 + +``` +DESCRIBE ${table_name} PARTITION ($partition_name); +``` + +**查询** + +::: warning + +**注意** +如使用过滤和聚合等功能来查询子列, 需要对子列进行额外的 cast 操作(因为存储类型不一定是固定的,需要有一个 SQL 统一的类型)。 +例如 SELECT * FROM tbl where CAST(var["titile"] as text) MATCH "hello world" +以下简化的示例说明了如何使用 VARIANT 进行查询 + +::: + +下面是典型的三个查询场景: + +1. 从 github_events 表中获取 top 5 star 数的代码库 + +``` sql +mysql> SELECT + -> cast(repo["name"] as text), count() AS stars Review Comment: done ########## docs/zh-CN/docs/sql-manual/sql-reference/Data-Types/VARIANT.md: ########## @@ -0,0 +1,347 @@ +--- +{ + "title": "VARIANT", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## VARIANT + +### Description + +VARIANT类型 +在 Doris 2.1 中引入一种新的数据类型 VARIANT,它可以存储半结构化 JSON 数据。它允许存储包含不同数据类型(如整数、字符串、布尔值等)的复杂数据结构,而无需在表结构中提前定义具体的列。VARIANT 类型特别适用于处理复杂的嵌套结构,而这些结构可能随时会发生变化。在写入过程中,该类型可以自动根据列的结构、类型推断列信息,并将其合并到现有表的 Schema 中。通过将 JSON 键及其对应的值存储为列和动态子列。 + +### Note + +相比 JSON 类型有有以下优势: + +1. 存储方式不同, JSON 类型是以二进制 JSONB 格式进行存储,整行 JSON 以行存的形式存储到 segment 文件中。而 VARIANT 类型在写入的时候进行类型推断,将写入的 JSON 列存化。比JSON类型有更高的压缩比, 存储空间更好。 +2. 查询方式同步,查询不需要进行解析。VARIANT 充分利用 Doris 中列式存储、向量化引擎、优化器等组件给用户带来极高的查询性能。 +下面是基于 clickbench 数据测试的结果: + +| | 存储空间 | +|--------------|------------| +| 预定义静态列 | 24.329 GB | +| variant 类型 | 24.296 GB | +| json 类型 | 46.730 GB | + + + +**节省约 50%存储容量** + +| 查询次数 | 预定义静态列 | variant 类型 | json 类型 | +|----------------|--------------|--------------|-----------------| +| 第一次查询 (cold) | 233.79s | 248.66s | **大部分查询超时** | +| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s | +| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s | + +**查询提速 8+倍, 查询性能与静态列相当** Review Comment: done ########## docs/zh-CN/docs/sql-manual/sql-reference/Data-Types/VARIANT.md: ########## @@ -0,0 +1,347 @@ +--- +{ + "title": "VARIANT", + "language": "zh-CN" +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## VARIANT + +### Description + +VARIANT类型 +在 Doris 2.1 中引入一种新的数据类型 VARIANT,它可以存储半结构化 JSON 数据。它允许存储包含不同数据类型(如整数、字符串、布尔值等)的复杂数据结构,而无需在表结构中提前定义具体的列。VARIANT 类型特别适用于处理复杂的嵌套结构,而这些结构可能随时会发生变化。在写入过程中,该类型可以自动根据列的结构、类型推断列信息,并将其合并到现有表的 Schema 中。通过将 JSON 键及其对应的值存储为列和动态子列。 + +### Note + +相比 JSON 类型有有以下优势: + +1. 存储方式不同, JSON 类型是以二进制 JSONB 格式进行存储,整行 JSON 以行存的形式存储到 segment 文件中。而 VARIANT 类型在写入的时候进行类型推断,将写入的 JSON 列存化。比JSON类型有更高的压缩比, 存储空间更好。 +2. 查询方式同步,查询不需要进行解析。VARIANT 充分利用 Doris 中列式存储、向量化引擎、优化器等组件给用户带来极高的查询性能。 +下面是基于 clickbench 数据测试的结果: + +| | 存储空间 | +|--------------|------------| +| 预定义静态列 | 24.329 GB | +| variant 类型 | 24.296 GB | +| json 类型 | 46.730 GB | + + + +**节省约 50%存储容量** + +| 查询次数 | 预定义静态列 | variant 类型 | json 类型 | +|----------------|--------------|--------------|-----------------| +| 第一次查询 (cold) | 233.79s | 248.66s | **大部分查询超时** | +| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s | +| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s | + +**查询提速 8+倍, 查询性能与静态列相当** + +### Example + +``` +用一个从建表、导数据、查询全周期的例子说明VARIANT的功能和用法。 +``` + +**建表语法** + +建表,建表语法关键字 variant + +``` sql +-- 无索引 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT +) +table_properties; + +-- 在v列创建索引,可选指定分词方式,默认不分词 +CREATE TABLE IF NOT EXISTS ${table_name} ( + k BIGINT, + v VARIANT, + INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment'] +) +table_properties; +``` + +**查询语法** + +``` sql +-- 1. 查询方式1, 查询语法稍有不同,v 后使用:分割,.访问子列,例如 +SELECT v:`properties`.`title` from ${table_name} +-- 2. 查询方式2, 使用 v['a']['b'] 形式例如 +SELECT v["properties"]["title"] from ${table_name} +``` + +### 基于 github events 数据集示例 + +这里用 github events 数据展示 variant 的建表、导入、查询。 +下面是格式化后的一行数据 + +``` json +{ + "id": "14186154924", + "type": "PushEvent", + "actor": { + "id": 282080, + "login": "brianchandotcom", + "display_login": "brianchandotcom", + "gravatar_id": "", + "url": "https://api.github.com/users/brianchandotcom", + "avatar_url": "https://avatars.githubusercontent.com/u/282080?" + }, + "repo": { + "id": 1920851, + "name": "brianchandotcom/liferay-portal", + "url": "https://api.github.com/repos/brianchandotcom/liferay-portal" + }, + "payload": { + "push_id": 6027092734, + "size": 4, + "distinct_size": 4, + "ref": "refs/heads/master", + "head": "91edd3c8c98c214155191feb852831ec535580ba", + "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d", + "commits": [""] + }, + "public": true, + "created_at": "2020-11-13T18:00:00Z" +} +``` + +**建表** + +- 创建了三个 VARIANT 类型的列, `actor`,`repo` 和 `payload` +- 创建表的同时创建了 `payload` 列的倒排索引 `idx_payload` +- USING INVERTED 指定索引类型是倒排索引,用于加速子列的条件过滤 +- `PROPERTIES("parser" = "english")` 指定采用 english 分词 + +``` sql +CREATE DATABASE test_variant; +USE test_variant; +CREATE TABLE IF NOT EXISTS github_events ( + id BIGINT NOT NULL, + type VARCHAR(30) NULL, + actor VARIANT NULL, + repo VARIANT NULL, + payload VARIANT NULL, + public BOOLEAN NULL, + created_at DATETIME NULL, + INDEX idx_payload (`payload`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for payload' +) +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(id) BUCKETS 10 +properties("replication_num" = "1"); +``` + +**需要注意的是:** + +::: tip + +1. 在 VARIANT 列上创建索引,比如 payload 的子列很多时,可能会造成索引列过多,影响写入性能 +2. 同一个 VARIANT 列的分词属性是相同的,如果您有不同的分词需求,那么可以创建多个 VARIANT 然后分别指定索引属性 + +::: + +**使用 streamload 导入** + +导入gh_2022-11-07-3.json,这是 github events 一个小时的数据 + +``` shell +wget http://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/variant/gh_2022-11-07-3.json + +curl --location-trusted -u root: -T gh_2022-11-07-3.json -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:18148/api/test_variant/github_events/_strea +m_load + +{ + "TxnId": 2, + "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf", + "Comment": "", + "TwoPhaseCommit": "false", + "Status": "Success", + "Message": "OK", + "NumberTotalRows": 139325, + "NumberLoadedRows": 139325, + "NumberFilteredRows": 0, + "NumberUnselectedRows": 0, + "LoadBytes": 633782875, + "LoadTimeMs": 7870, + "BeginTxnTimeMs": 19, + "StreamLoadPutTimeMs": 162, + "ReadDataTimeMs": 2416, + "WriteDataTimeMs": 7634, + "CommitAndPublishTimeMs": 55 +} +``` + +确认导入成功 + +``` sql +-- 查看行数 +mysql> select count() from github_events; ++----------+ +| count(*) | ++----------+ +| 139325 | ++----------+ +1 row in set (0.25 sec) + +-- 随机看一条数据 +mysql> select * from github_events limit 1; ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| id | type | actor | repo | payload | public | created_at | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +| 25061821748 | PushEvent | {"gravatar_id":"","display_login":"jfrog-pipelie-intg","url":"https://api.github.com/users/jfrog-pipelie-intg","id":98024358,"login":"jfrog-pipelie-intg","avatar_url":"https://avatars.githubusercontent.com/u/98024358?"} | {"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16","id":562683829,"name":"jfrog-pipelie-intg/jfinte2e_1667789956723_16"} | {"commits":[{"sha":"334433de436baa198024ef9f55f0647721bcd750","author":{"email":"98024358+jfrog-pipelie-i...@users.noreply.github.com","name":"jfrog-pipelie-intg"},"message":"commit message 10238493157623136117","distinct":true,"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16/commits/334433de436baa198024ef9f55f0647721bcd750"}],"before":"f84a26792f44d54305ddd41b7e3a79d25b1a9568","head":"334433de436baa198024ef9f55f0647721bcd750","size":1,"push_id":11572649828,"ref":"refs/heads/test-notification-sent-branch-10238493157623136113","distinct_size":1} | 1 | 2022-11-07 11:00:00 | ++-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------- +---------------------+ +1 row in set (0.23 sec) +``` + +desc 查看 schema 信息,子列会在存储层自动扩展、并进行类型推导 + +``` sql +mysql> set describe_extend_variant_column = true; Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org