This is an automated email from the ASF dual-hosted git repository. jiafengzheng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 78acfb0cb8b fix 78acfb0cb8b is described below commit 78acfb0cb8bc87aac2b95c6073c8dcbacaa8b997 Author: jiafeng.zhang <zhang...@gmail.com> AuthorDate: Mon Aug 29 13:36:04 2022 +0800 fix --- .../import/import-scenes/external-storage-load.md | 2 +- .../import/import-scenes/external-table-load.md | 14 +- .../data-operate/import/import-scenes/jdbc-load.md | 8 +- .../import/import-way/routine-load-manual.md | 445 +++++++++++++-------- .../import/import-scenes/external-storage-load.md | 10 +- .../import/import-scenes/external-table-load.md | 12 +- .../data-operate/import/import-scenes/jdbc-load.md | 8 +- .../import/import-way/routine-load-manual.md | 23 +- .../current/get-starting/get-starting.md | 10 +- 9 files changed, 319 insertions(+), 213 deletions(-) diff --git a/docs/data-operate/import/import-scenes/external-storage-load.md b/docs/data-operate/import/import-scenes/external-storage-load.md index f1e01469416..4909e0cd03e 100644 --- a/docs/data-operate/import/import-scenes/external-storage-load.md +++ b/docs/data-operate/import/import-scenes/external-storage-load.md @@ -114,7 +114,7 @@ Starting from version 0.14, Doris supports the direct import of data from online This document mainly introduces how to import data stored in AWS S3. It also supports the import of other object storage systems that support the S3 protocol. ### Applicable scenarios -* Source data in S3 protocol accessible storage systems, such as S3, BOS. +* Source data in S3 protocol accessible storage systems, such as S3. * Data volumes range from tens to hundreds of GB. ### Preparing diff --git a/docs/data-operate/import/import-scenes/external-table-load.md b/docs/data-operate/import/import-scenes/external-table-load.md index 7fae1d101c3..8fa7ef3530a 100644 --- a/docs/data-operate/import/import-scenes/external-table-load.md +++ b/docs/data-operate/import/import-scenes/external-table-load.md @@ -26,17 +26,21 @@ under the License. # Synchronize data through external table -Doris can create external tables that are accessed through the ODBC protocol. After the creation is complete, you can directly query the data of the external table through the SELECT statement, or import the data of the external table through `INSERT INTO SELECT`. +Doris can create external tables. Once created, you can query the data of the external table directly with the SELECT statement or import the data of the external table with the `INSERT INTO SELECT` method. -This document describes how to create external tables accessed through the ODBC protocol, and how to import data from these external tables. Currently supported data sources include: +Doris external tables currently support the following data sources. - MySQL - Oracle - PostgreSQL -- SQL Server -- Hive (supported in version 1.0) +- SQLServer +- Hive +- Iceberg +- ElasticSearch -## create external table +This document describes how to create external tables accessible through the ODBC protocol and how to import data from these external tables. + +## Create external table For a detailed introduction to creating ODBC external tables, please refer to the [CREATE ODBC TABLE](../../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-EXTERNAL-TABLE.md) syntax help manual. diff --git a/docs/data-operate/import/import-scenes/jdbc-load.md b/docs/data-operate/import/import-scenes/jdbc-load.md index ea3d35e3bdf..5dc6a2fd5c6 100644 --- a/docs/data-operate/import/import-scenes/jdbc-load.md +++ b/docs/data-operate/import/import-scenes/jdbc-load.md @@ -1,6 +1,6 @@ --- { - "title": "Synchronize data using JDBC", + "title": "Synchronize data using Insert method", "language": "en" } --- @@ -24,9 +24,9 @@ specific language governing permissions and limitations under the License. --> -# Synchronize data using JDBC +# Synchronize data using Insert method -Users can use INSERT statements to import data through the JDBC protocol. +Users can use INSERT statement to import data through MySQL protocol. The INSERT statement is used in a similar way to the INSERT statement used in databases such as MySQL. The INSERT statement supports the following two syntaxes: @@ -37,7 +37,7 @@ The INSERT statement is used in a similar way to the INSERT statement used in da Here we only introduce the second way. For a detailed description of the INSERT command, see the [INSERT](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT.md) command documentation. -## single write +## Single write Single write means that the user directly executes an INSERT command. An example is as follows: diff --git a/docs/data-operate/import/import-way/routine-load-manual.md b/docs/data-operate/import/import-way/routine-load-manual.md index 3d7ce77fd55..651df49ee6c 100644 --- a/docs/data-operate/import/import-way/routine-load-manual.md +++ b/docs/data-operate/import/import-way/routine-load-manual.md @@ -87,250 +87,345 @@ Currently we only support routine load from the Kafka system. This section detai ### Create a routine load task -The detailed syntax for creating a routine load task can be connected to Doris and execute `HELP ROUTINE LOAD;` to see the syntax help. Here is a detailed description of the precautions when creating a job. - -* columns_mapping - - `columns_mapping` is mainly used to specify the column structure of the table structure and message, as well as the conversion of some columns. If not specified, Doris will default to the columns in the message and the columns of the table structure in a one-to-one correspondence. Although under normal circumstances, if the source data is exactly one-to-one, normal data load can be performed without specifying. However, we still strongly recommend that users ** explicitly specify col [...] - - In `columns_mapping` we can also use some built-in functions for column conversion. But you need to pay attention to the actual column type corresponding to the function parameters. for example: - - Suppose the user needs to load a table containing only a column of `k1` with a column type of `int`. And you need to convert the null value in the source file to 0. This feature can be implemented with the `ifnull` function. The correct way to use is as follows: - - `COLUMNS (xx, k1=ifnull(xx, "3"))` - - Note that we use `"3"` instead of `3`, although `k1` is of type `int`. Because the column type in the source data is `varchar` for the load task, the `xx` virtual column is also of type `varchar`. So we need to use `"3"` to match the match, otherwise the `ifnull` function can't find the function signature with the parameter `(varchar, int)`, and an error will occur. - - As another example, suppose the user needs to load a table containing only a column of `k1` with a column type of `int`. And you need to process the corresponding column in the source file: convert the negative number to a positive number and the positive number to 100. This function can be implemented with the `case when` function. The correct wording should be as follows: - - `COLUMNS (xx, k1 = case when xx < 0 then cast(-xx as varchar) else cast((xx + '100') as varchar) end)` - - Note that we need to convert all the parameters in `case when` to varchar in order to get the desired result. - -* where_predicates - - The type of the column in `where_predicates` is already the actual column type, so there is no need to cast to the varchar type as `columns_mapping`. Write according to the actual column type. - -* desired\_concurrent\_number - - `desired_concurrent_number` is used to specify the degree of concurrency expected for a routine job. That is, a job, at most how many tasks are executing at the same time. For Kafka load, the current actual concurrency is calculated as follows: - - ``` - Min(partition num, desired_concurrent_number, Config.max_routine_load_task_concurrent_num) - ``` - - Where `Config.max_routine_load_task_concurrent_num` is a default maximum concurrency limit for the system. This is a FE configuration that can be adjusted by changing the configuration. The default is 5. - - Where partition num refers to the number of partitions for the Kafka topic subscribed to. - -* max\_batch\_interval/max\_batch\_rows/max\_batch\_size - - These three parameters are used to control the execution time of a single task. If any of the thresholds is reached, the task ends. Where `max_batch_rows` is used to record the number of rows of data read from Kafka. `max_batch_size` is used to record the amount of data read from Kafka in bytes. The current consumption rate for a task is approximately 5-10MB/s. - - So assume a row of data 500B, the user wants to be a task every 100MB or 10 seconds. The expected processing time for 100MB is 10-20 seconds, and the corresponding number of rows is about 200000 rows. Then a reasonable configuration is: - - ``` - "max_batch_interval" = "10", - "max_batch_rows" = "200000", - "max_batch_size" = "104857600" - ``` +The detailed syntax for creating a routine import task can be found in [CREATE ROUTINE LOAD](... /... /... /sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md) after connecting to Doris command manual, or execute `HELP ROUTINE LOAD;` for syntax help. + +Here we illustrate how to create Routine Load tasks with a few examples. + +1. Create a Kafka example import task named test1 for example_tbl of example_db. Specify the column separator and group.id and client.id, and automatically consume all partitions by default and subscribe from the location where data is available (OFFSET_BEGINNING). + +```sql +CREATE ROUTINE LOAD example_db.test1 ON example_tbl + COLUMNS TERMINATED BY ",", + COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100) + PROPERTIES + ( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "false" + ) + FROM KAFKA + ( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "property.group.id" = "xxx", + "property.client.id" = "xxx", + "property.kafka_default_offsets" = "OFFSET_BEGINNING" + ); +``` - The parameters in the above example are also the default parameters for these configurations. +2. Create a Kafka example import task named test1 for example_tbl of example_db in **strict mode**. + +```sql +CREATE ROUTINE LOAD example_db.test1 ON example_tbl + COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100), + WHERE k1 > 100 and k2 like "%doris%" + PROPERTIES + ( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "true" + ) + FROM KAFKA + ( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "kafka_partitions" = "0,1,2,3", + "kafka_offsets" = "101,0,0,200" + ); +``` -* max\_error\_number +>Notes: +> +>"strict_mode" = "true" - `max_error_number` is used to control the error rate. When the error rate is too high, the job will automatically pause. Because the entire job is stream-oriented, and because of the borderless nature of the data stream, we can't calculate the error rate with an error ratio like other load tasks. So here is a new way of calculating to calculate the proportion of errors in the data stream. +3. Example of importing data in Json format - We have set up a sampling window. The size of the window is `max_batch_rows * 10`. Within a sampling window, if the number of error lines exceeds `max_error_number`, the job is suspended. If it is not exceeded, the next window restarts counting the number of error lines. + Routine Load only supports the following two types of json formats - We assume that `max_batch_rows` is 200000 and the window size is 2000000. Let `max_error_number` be 20000, that is, the user expects an error behavior of 20000 for every 2000000 lines. That is, the error rate is 1%. But because not every batch of tasks consumes 200000 rows, the actual range of the window is [2000000, 2200000], which is 10% statistical error. + The first one has only one record and is a json object. - The error line does not include rows that are filtered out by the where condition. But include rows that do not have a partition in the corresponding Doris table. +```json +{"category":"a9jadhx","author":"test","price":895} +``` -* data\_source\_properties +The second one is a json array, which can contain multiple records + +```json +[ + { + "category":"11", + "author":"4avc", + "price":895, + "timestamp":1589191587 + }, + { + "category":"22", + "author":"2avc", + "price":895, + "timestamp":1589191487 + }, + { + "category":"33", + "author":"3avc", + "price":342, + "timestamp":1589191387 + } +] +``` - The specific Kafka partition can be specified in `data_source_properties`. If not specified, all partitions of the subscribed topic are consumed by default. +Create the Doris data table to be imported + +```sql +CREATE TABLE `example_tbl` ( + `category` varchar(24) NULL COMMENT "", + `author` varchar(24) NULL COMMENT "", + `timestamp` bigint(20) NULL COMMENT "", + `dt` int(11) NULL COMMENT "", + `price` double REPLACE +) ENGINE=OLAP +AGGREGATE KEY(`category`,`author`,`timestamp`,`dt`) +COMMENT "OLAP" +PARTITION BY RANGE(`dt`) +( + PARTITION p0 VALUES [("-2147483648"), ("20200509")), + PARTITION p20200509 VALUES [("20200509"), ("20200510")), + PARTITION p20200510 VALUES [("20200510"), ("20200511")), + PARTITION p20200511 VALUES [("20200511"), ("20200512")) +) +DISTRIBUTED BY HASH(`category`,`author`,`timestamp`) BUCKETS 4 +PROPERTIES ( + "replication_num" = "1" +); +``` - Note that when partition is explicitly specified, the load job will no longer dynamically detect changes to Kafka partition. If not specified, the partitions that need to be consumed are dynamically adjusted based on changes in the kafka partition. +Import json data in simple mode + +```sql +CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 +COLUMNS(category,price,author) +PROPERTIES +( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "false", + "format" = "json" +) +FROM KAFKA +( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "kafka_partitions" = "0,1,2", + "kafka_offsets" = "0,0,0" + ); +``` -* strict\_mode +Accurate import of data in json format + +```sql +CREATE ROUTINE LOAD example_db.test1 ON example_tbl +COLUMNS(category, author, price, timestamp, dt=from_unixtime(timestamp, '%Y%m%d')) +PROPERTIES +( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "false", + "format" = "json", + "jsonpaths" = "[\"$.category\",\"$.author\",\"$.price\",\"$.timestamp\"]", + "strip_outer_array" = "true" +) +FROM KAFKA +( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "kafka_partitions" = "0,1,2", + "kafka_offsets" = "0,0,0" +); +``` - Routine load load can turn on strict mode mode. The way to open it is to add ```"strict_mode" = "true"``` to job\_properties. The default strict mode is off. +>**Notes:** +> +>The partition field `dt` in the table is not in our data, but is converted in our Routine load statement by `dt=from_unixtime(timestamp, '%Y%m%d')` - The strict mode mode means strict filtering of column type conversions during the load process. The strict filtering strategy is as follows: +**strict mode import relationship with source data** - 1. For column type conversion, if strict mode is true, the wrong data will be filtered. The error data here refers to the fact that the original data is not null, and the result is a null value after participating in the column type conversion. +Here is an example with a column type of TinyInt - 2. When a loaded column is generated by a function transformation, strict mode has no effect on it. +> Notes: When a column in the table allows importing null values - 3. For a column type loaded with a range limit, if the original data can pass the type conversion normally, but cannot pass the range limit, strict mode will not affect it. For example, if the type is decimal(1,0) and the original data is 10, it is eligible for type conversion but not for column declarations. This data strict has no effect on it. +> -* merge\_type - The type of data merging supports three types: APPEND, DELETE, and MERGE. APPEND is the default value, which means that all this batch of data needs to be appended to the existing data. DELETE means to delete all rows with the same key as this batch of data. MERGE semantics Need to be used in conjunction with the delete condition, which means that the data that meets the delete condition is processed according to DELETE semantics and the rest is processed according to APPEND semantics +| source data | source data example | string to int | strict_mode | result | +| ----------- | ------------------- | ------------- | ------------- | ---------------------- | +| Null value | \N | N/A | true or false | NULL | +| not null | aaa or 2000 | NULL | true | invalid data(filtered) | +| not null | aaa | NULL | false | NULL | +| not null | 1 | 1 | true or false | correct data | -**strict mode and load relationship of source data** +Here is an example with the column type Decimal(1,0) -Here is an example of a column type of TinyInt. +> Notes: +> +> When the columns in the table allow importing null values -> Note: When a column in a table allows a null value to be loaded +| source data | source data example | string to int | strict_mode | result | +| ----------- | ------------------- | ------------- | ------------- | ---------------------- | +| Null value | \N | N/A | true or false | NULL | +| not null | aaa | NULL | true | invalid data(filtered) | +| not null | aaa | NULL | false | NULL | +| not null | 1 or 10 | 1 | true or false | correct data | -|source data | source data example | string to int | strict_mode | result| -|------------|---------------------|-----------------|--------------------|---------| -|null | \N | N/A | true or false | NULL| -|not null | aaa or 2000 | NULL | true | invalid data(filtered)| -|not null | aaa | NULL | false | NULL| -|not null | 1 | 1 | true or false | correct data| +> Notes: +> +> Although 10 is an out-of-range value, it is not affected by strict mode because its type meets the decimal requirement. 10 will eventually be filtered in other ETL processing processes. But it will not be filtered by strict mode. -Here the column type is Decimal(1,0) +**Accessing an SSL-certified Kafka cluster** -> Note: When a column in a table allows a null value to be loaded +Accessing an SSL-certified Kafka cluster requires the user to provide the certificate file (ca.pem) used to authenticate the Kafka Broker's public key. If the Kafka cluster also has client authentication enabled, the client's public key (client.pem), the key file (client.key), and the key password are also required. The required files need to be uploaded to Doris first via the `CREAE FILE` command, **and the catalog name is `kafka`**. See `HELP CREATE FILE;` for help with the `CREATE FIL [...] -|source data | source data example | string to int | strict_mode | result| -|------------|---------------------|-----------------|--------------------|--------| -|null | \N | N/A | true or false | NULL| -|not null | aaa | NULL | true | invalid data(filtered)| -|not null | aaa | NULL | false | NULL| -|not null | 1 or 10 | 1 | true or false | correct data| +1. uploading a file -> Note: 10 Although it is a value that is out of range, because its type meets the requirements of decimal, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows. But it will not be filtered by strict mode. +```sql +CREATE FILE "ca.pem" PROPERTIES("url" = "https://example_url/kafka-key/ca.pem", "catalog" = "kafka"); +CREATE FILE "client.key" PROPERTIES("url" = "https://example_urlkafka-key/client.key", "catalog" = "kafka"); +CREATE FILE "client.pem" PROPERTIES("url" = "https://example_url/kafka-key/client.pem", "catalog" = "kafka"); +``` -**Accessing SSL-certified Kafka clusters** +2. Create routine import jobs + +```sql +CREATE ROUTINE LOAD db1.job1 on tbl1 +PROPERTIES +( + "desired_concurrent_number"="1" +) +FROM KAFKA +( + "kafka_broker_list"= "broker1:9091,broker2:9091", + "kafka_topic" = "my_topic", + "property.security.protocol" = "ssl", + "property.ssl.ca.location" = "FILE:ca.pem", + "property.ssl.certificate.location" = "FILE:client.pem", + "property.ssl.key.location" = "FILE:client.key", + "property.ssl.key.password" = "abcdefg" +); +``` -Accessing the SSL-certified Kafka cluster requires the user to provide a certificate file (ca.pem) for authenticating the Kafka Broker public key. If the Kafka cluster has both client authentication enabled, you will also need to provide the client's public key (client.pem), key file (client.key), and key password. The files needed here need to be uploaded to Doris via the `CREAE FILE` command, ** and the catalog name is `kafka`**. See `HELP CREATE FILE;` for specific help on the `CREATE [...] +>Doris accesses Kafka clusters through Kafka's C++ API `librdkafka`. The parameters supported by `librdkafka` can be found in +> +>[https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md](https://github.com/edenhill/librdkafka/blob/master/ CONFIGURATION.md) +> +> -1. Upload file +### Viewing Job Status - ``` - CREATE FILE "ca.pem" PROPERTIES("url" = "https://example_url/kafka-key/ca.pem", "catalog" = "kafka"); - CREATE FILE "client.key" PROPERTIES("url" = "https://example_urlkafka-key/client.key", "catalog" = "kafka"); - CREATE FILE "client.pem" PROPERTIES("url" = "https://example_url/kafka-key/client.pem", "catalog" = "kafka"); - ``` +Specific commands and examples to view the status of **jobs** can be viewed with the `HELP SHOW ROUTINE LOAD;` command. -2. Create a routine load job +Specific commands and examples to view the status of **tasks** running can be viewed with the `HELP SHOW ROUTINE LOAD TASK;` command. - ``` - CREATE ROUTINE LOAD db1.job1 on tbl1 - PROPERTIES - ( - "desired_concurrent_number"="1" - ) - FROM KAFKA - ( - "kafka_broker_list"= "broker1:9091,broker2:9091", - "kafka_topic" = "my_topic", - "property.security.protocol" = "ssl", - "property.ssl.ca.location" = "FILE:ca.pem", - "property.ssl.certificate.location" = "FILE:client.pem", - "property.ssl.key.location" = "FILE:client.key", - "property.ssl.key.password" = "abcdefg" - ); - ``` +Only currently running tasks can be viewed; closed and unstarted tasks cannot be viewed. -> Doris accesses Kafka clusters via Kafka's C++ API `librdkafka`. The parameters supported by `librdkafka` can be found. -> -> <https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md> +### Modify job properties -### Viewing the status of the load job +Users can modify jobs that have already been created. The details can be viewed with the `HELP ALTER ROUTINE LOAD;` command or see [ALTER ROUTINE LOAD](... /... /... /sql-manual/sql-reference/Data-Manipulation-Statements/Load/ALTER-ROUTINE-LOAD.md). -Specific commands and examples for viewing the status of the **job** can be viewed with the `HELP SHOW ROUTINE LOAD;` command. +### Job Control -Specific commands and examples for viewing the **Task** status can be viewed with the `HELP SHOW ROUTINE LOAD TASK;` command. +The user can control the stop, pause and restart of jobs with the `STOP/PAUSE/RESUME` commands. Help and examples can be viewed with the `HELP STOP ROUTINE LOAD;` `HELP PAUSE ROUTINE LOAD;` and `HELP RESUME ROUTINE LOAD;` commands. -You can only view tasks that are currently running, and tasks that have ended and are not started cannot be viewed. +## Other notes -### Alter job +1. The relationship between a routine import job and an ALTER TABLE operation -Users can modify jobs that have been created. Specific instructions can be viewed through the `HELP ALTER ROUTINE LOAD;` command. Or refer to [ALTER ROUTINE LOAD](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/ALTER-ROUTINE-LOAD.md). + - Example import does not block SCHEMA CHANGE and ROLLUP operations. However, note that if the column mapping relationships do not match after the SCHEMA CHANGE completes, it can cause a spike in error data for the job and eventually cause the job to pause. It is recommended that you reduce this problem by explicitly specifying column mapping relationships in routine import jobs and by adding Nullable columns or columns with Default values. + - Deleting a Partition of a table may cause the imported data to fail to find the corresponding Partition and the job to enter a pause. 2. -### Job Control +2. Relationship between routine import jobs and other import jobs (LOAD, DELETE, INSERT) -The user can control the stop, pause and restart of the job by the three commands `STOP/PAUSE/RESUME`. You can view help and examples with the three commands `HELP STOP ROUTINE LOAD;`, `HELP PAUSE ROUTINE LOAD;` and `HELP RESUME ROUTINE LOAD;`. + - There is no conflict between the routine import and other LOAD operations and INSERT operations. + - When the DELETE operation is executed, the corresponding table partition cannot have any ongoing import jobs. Therefore, before executing DELETE operation, you may need to suspend the routine import job and wait until all the issued tasks are completed before executing DELETE. 3. -## other instructions +3. The relationship between routine import and DROP DATABASE/TABLE operations -1. The relationship between a routine load job and an ALTER TABLE operation + When the database or table corresponding to the routine import is deleted, the job will automatically CANCEL. - * Routine load does not block SCHEMA CHANGE and ROLLUP operations. Note, however, that if the column mappings are not matched after SCHEMA CHANGE is completed, the job's erroneous data will spike and eventually cause the job to pause. It is recommended to reduce this type of problem by explicitly specifying column mappings in routine load jobs and by adding Nullable columns or columns with Default values. - * Deleting a Partition of a table may cause the loaded data to fail to find the corresponding Partition and the job will be paused. +4. The relationship between kafka type routine import jobs and kafka topic -2. Relationship between routine load jobs and other load jobs (LOAD, DELETE, INSERT) + When the `kafka_topic` declared by the user in the create routine import statement does not exist in the kafka cluster. - * Routine load does not conflict with other LOAD jobs and INSERT operations. - * When performing a DELETE operation, the corresponding table partition cannot have any load tasks being executed. Therefore, before performing the DELETE operation, you may need to pause the routine load job and wait for the delivered task to complete before you can execute DELETE. + - If the broker of the user's kafka cluster has `auto.create.topics.enable = true` set, then `kafka_topic` will be created automatically first, and the number of partitions created automatically is determined by the configuration of the broker in the **user's kafka cluster** with `num. partitions`. The routine job will keep reading data from the topic as normal. + - If the broker in the user's kafka cluster has `auto.create.topics.enable = false` set, the topic will not be created automatically and the routine job will be suspended with a status of `PAUSED` before any data is read. -3. Relationship between routine load jobs and DROP DATABASE/TABLE operations + So, if you want the kafka topic to be automatically created by the routine when it does not exist, just set `auto.create.topics.enable = true` for the broker in the **user's kafka cluster**. - When the corresponding database or table is deleted, the job will automatically CANCEL. +5. Problems that may arise in network isolated environments In some environments there are isolation measures for network segments and domain name resolution, so care needs to be taken -4. The relationship between the kafka type routine load job and kafka topic + 1. the Broker list specified in the Create Routine load task must be accessible by the Doris service + 2. If `advertised.listeners` is configured in Kafka, the addresses in `advertised.listeners` must be accessible to the Doris service - When the user creates a routine load declaration, the `kafka_topic` does not exist in the kafka cluster. +6. Specify the Partition and Offset for consumption - * If the broker of the user kafka cluster has `auto.create.topics.enable = true` set, `kafka_topic` will be automatically created first, and the number of partitions created automatically will be in the kafka cluster** of the user side. The broker is configured with `num.partitions`. The routine job will continue to read the data of the topic continuously. - * If the broker of the user kafka cluster has `auto.create.topics.enable = false` set, topic will not be created automatically, and the routine will be paused before any data is read, with the status `PAUSED`. + Doris supports specifying a Partition and Offset to start consumption. The new version also supports the ability to specify time points for consumption. The configuration of the corresponding parameters is explained here. - So, if the user wants to be automatically created by the routine when the kafka topic does not exist, just set the broker in the kafka cluster** of the user's side to set auto.create.topics.enable = true` . - -5. Problems that may occur in the some environment - In some environments, there are isolation measures for network segment and domain name resolution. So should pay attention to: - 1. The broker list specified in the routine load task must be accessible on the doris environment. - 2. If `advertised.listeners` is configured in kafka, The addresses in `advertised.listeners` need to be accessible on the doris environment. + There are three relevant parameters. -6. About specified Partition and Offset + - `kafka_partitions`: Specify the list of partitions to be consumed, e.g., "0, 1, 2, 3". + - `kafka_offsets`: specifies the starting offset of each partition, which must correspond to the number of `kafka_partitions` list. For example: "1000, 1000, 2000, 2000" + - `property.kafka_default_offset`: specifies the default starting offset of the partitions. - Doris supports specifying Partition and Offset to start consumption. The new version also supports the consumption function at a specified time point. The configuration relationship of the corresponding parameters is explained here. - - There are three relevant parameters: - - * `kafka_partitions`: Specify the list of partitions to be consumed, such as: "0, 1, 2, 3". - * `kafka_offsets`: Specify the starting offset of each partition, which must correspond to the number of `kafka_partitions` lists. Such as: "1000, 1000, 2000, 2000" - * `property.kafka_default_offset`: Specify the default starting offset of the partition. + When creating an import job, these three parameters can have the following combinations. - When creating an routine load job, these three parameters can have the following combinations: - - | Combinations | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offset` | Behavior | - |---|---|---|---|---| - |1| No | No | No | The system will automatically find all the partitions corresponding to the topic and start consumption from OFFSET_END | - |2| No | No | Yes | The system will automatically find all the partitions corresponding to the topic and start consumption from the position specified by the default offset | - |3| Yes | No | No | The system will start consumption from the OFFSET_END of the specified partition | - |4| Yes | Yes | No | The system will start consumption from the specified offset of the specified partition | - |5| Yes | No | Yes | The system will start consumption from the specified partition and the location specified by the default offset | + | combinations | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offset` | behavior | + | ------------ | ------------------ | --------------- | ------------------------------- | ------------------------------------------------------------ | + | 1 | No | No | No | The system will automatically find all partitions corresponding to the topic and start consuming them from OFFSET_END | + | 2 | No | No | Yes | The system will automatically find all the partitions corresponding to the topic and start consuming them from the default offset location. | + | 3 | Yes | No | No | The system will start consuming from the OFFSET_END of the specified partition. | + | 4 | Yes | Yes | No | The system will start consuming at the specified offset of the specified partition. | + | 5 | Yes | No | Yes | The system will start consuming at the default offset of the specified partition | - 7. The difference between STOP and PAUSE +7. The difference between STOP and PAUSE + + FE will automatically clean up the ROUTINE LOAD in STOP status periodically, while the PAUSE status can be restored to enable again. - the FE will automatically clean up stopped ROUTINE LOAD,while paused ROUTINE LOAD can be resumed +## Related Parameters -## Related parameters +Some system configuration parameters can affect the use of routine import. -Some system configuration parameters can affect the use of routine loads. +1. max_routine_load_task_concurrent_num -1. max\_routine\_load\_task\_concurrent\_num + FE configuration item, defaults to 5 and can be modified at runtime. This parameter limits the maximum number of concurrent subtasks for a routine import job. It is recommended to keep the default value. Setting it too large may result in too many concurrent tasks and consume cluster resources. - The FE configuration item, which defaults to 5, can be modified at runtime. This parameter limits the maximum number of subtask concurrency for a routine load job. It is recommended to maintain the default value. If the setting is too large, it may cause too many concurrent tasks and occupy cluster resources. +2. max_routine_load_task_num_per_be -2. max\_routine_load\_task\_num\_per\_be + FE configuration item, default is 5, can be modified at runtime. This parameter limits the maximum number of concurrently executed subtasks per BE node. It is recommended to keep the default value. If set too large, it may lead to too many concurrent tasks and consume cluster resources. - The FE configuration item, which defaults to 5, can be modified at runtime. This parameter limits the number of subtasks that can be executed concurrently by each BE node. It is recommended to maintain the default value. If the setting is too large, it may cause too many concurrent tasks and occupy cluster resources. +3. max_routine_load_job_num -3. max\_routine\_load\_job\_num + FE configuration item, default is 100, can be modified at runtime. This parameter limits the total number of routine import jobs, including the states NEED_SCHEDULED, RUNNING, PAUSE. After this, no new jobs can be submitted. - The FE configuration item, which defaults to 100, can be modified at runtime. This parameter limits the total number of routine load jobs, including NEED_SCHEDULED, RUNNING, PAUSE. After the overtime, you cannot submit a new assignment. +4. max_consumer_num_per_group -4. max\_consumer\_num\_per\_group + BE configuration item, default is 3. This parameter indicates the maximum number of consumers that can be generated for data consumption in a subtask. For a Kafka data source, a consumer may consume one or more kafka partitions. If there are only 2 partitions, only 2 consumers are generated, each consuming 1 partition. 5. push_write_mby - BE configuration item, the default is 3. This parameter indicates that up to several consumers are generated in a subtask for data consumption. For a Kafka data source, a consumer may consume one or more kafka partitions. Suppose a task needs to consume 6 kafka partitions, it will generate 3 consumers, and each consumer consumes 2 partitions. If there are only 2 partitions, only 2 consumers will be generated, and each consumer will consume 1 partition. +5. push_write_mbytes_per_sec -5. push\_write\_mbytes\_per\_sec + BE configuration item. The default is 10, i.e. 10MB/s. This parameter is generic for importing and is not limited to routine import jobs. This parameter limits the speed at which imported data can be written to disk. For high performance storage devices such as SSDs, this speed limit can be increased as appropriate. 6. - BE configuration item. The default is 10, which is 10MB/s. This parameter is to load common parameters, not limited to routine load jobs. This parameter limits the speed at which loaded data is written to disk. For high-performance storage devices such as SSDs, this speed limit can be appropriately increased. +6. max_tolerable_backend_down_num -6. max\_tolerable\_backend\_down\_num - FE configuration item, the default is 0. Under certain conditions, Doris can reschedule PAUSED tasks, that becomes RUNNING?This parameter is 0, which means that rescheduling is allowed only when all BE nodes are in alive state. + FE configuration item, the default value is 0. Doris can PAUSED job rescheduling to RUNNING if certain conditions are met. 0 means rescheduling is allowed only if all BE nodes are ALIVE. -7. period\_of\_auto\_resume\_min - FE configuration item, the default is 5 mins. Doris reschedules will only try at most 3 times in the 5 minute period. If all 3 times fail, the current task will be locked, and auto-scheduling will not be performed. However, manual intervention can be performed. +7. period_of_auto_resume_min -## More Help + FE configuration item, the default is 5 minutes, Doris rescheduling will only be attempted up to 3 times within the 5 minute period. If all 3 attempts fail, the current task is locked and no further scheduling is performed. However, manual recovery can be done through human intervention. -For more detailed syntax used by **Routine load**, you can enter `HELP ROUTINE LOAD` on the Mysql client command line for more help. +## More help +For more detailed syntax on the use of **Routine Load**, you can type `HELP ROUTINE LOAD` at the Mysql client command line for more help. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-storage-load.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-storage-load.md index 8496f1d56e8..c6051ecd504 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-storage-load.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-storage-load.md @@ -120,21 +120,21 @@ Hdfs load 创建导入语句,导入方式和[Broker Load](../../../data-operat ### 适用场景 -* 源数据在 支持S3协议的存储系统中,如 S3,BOS 等。 +* 源数据在 支持S3协议的存储系统中,如 S3 等。 * 数据量在 几十到百GB 级别。 ### 准备工作 1. 准本AK 和 SK - 首先需要找到或者重新生成 AWS `Access keys`,可以在AWS console 的 `My Security Credentials` 找到生成方式, 如下图所示: + 首先需要找到或者重新生成 AWS `Access keys`,可以在 AWS console 的 `My Security Credentials` 找到生成方式, 如下图所示: [AK_SK](/images/aws_ak_sk.png) 选择 `Create New Access Key` 注意保存生成 AK和SK. 2. 准备 REGION 和 ENDPOINT - REGION 可以在创建桶的时候选择也可以在桶列表中查看到。ENDPOINT 可以通过如下页面通过REGION查到 [AWS 文档](https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region) + REGION 可以在创建桶的时候选择也可以在桶列表中查看到。ENDPOINT 可以通过如下页面通过 REGION 查到 [AWS 文档](https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region) 其他云存储系统可以相应的文档找到与S3兼容的相关信息 ### 开始导入 -导入方式和[Broker Load](../../../data-operate/import/import-way/broker-load-manual.md) 基本相同,只需要将 `WITH BROKER broker_name ()` 语句替换成如下部分 +导入方式和 [Broker Load](../../../data-operate/import/import-way/broker-load-manual.md) 基本相同,只需要将 `WITH BROKER broker_name ()` 语句替换成如下部分 ``` WITH S3 ( @@ -168,7 +168,7 @@ Hdfs load 创建导入语句,导入方式和[Broker Load](../../../data-operat ### 常见问题 -S3 SDK 默认使用 virtual-hosted style 方式。但某些对象存储系统可能没开启或没支持 virtual-hosted style 方式的访问,此时我们可以添加 `use_path_style` 参数来强制使用 path style 方式: +S3 SDK 默认使用 `virtual-hosted style` 方式。但某些对象存储系统可能没开启或没支持 `virtual-hosted style` 方式的访问,此时我们可以添加 `use_path_style` 参数来强制使用 `path style` 方式: ``` WITH S3 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-table-load.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-table-load.md index 19d85e419ea..76192769e86 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-table-load.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/external-table-load.md @@ -27,15 +27,19 @@ under the License. # 通过外部表同步数据 -Doris 可以创建通过 ODBC 协议访问的外部表。创建完成后,可以通过 SELECT 语句直接查询外部表的数据,也可以通过 `INSERT INTO SELECT` 的方式导入外部表的数据。 +Doris 可以创建外部表。创建完成后,可以通过 SELECT 语句直接查询外部表的数据,也可以通过 `INSERT INTO SELECT` 的方式导入外部表的数据。 -本文档主要介绍如何创建通过 ODBC 协议访问的外部表,以及如何导入这些外部表的数据。目前支持的数据源包括: +Doris 外部表目前支持的数据源包括: - MySQL - Oracle - PostgreSQL - SQLServer -- Hive(1.0版本支持) +- Hive +- Iceberg +- ElasticSearch + +本文档主要介绍如何创建通过 ODBC 协议访问的外部表,以及如何导入这些外部表的数据。 ## 创建外部表 @@ -124,5 +128,3 @@ PROPERTIES ( ## 更多帮助 关于 CREATE EXTERNAL TABLE 的更多详细语法和最佳实践,请参阅 [CREATE EXTERNAL TABLE](../../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-EXTERNAL-TABLE.md) 命令手册。 - -Doris ODBC 更多使用示例请参考 [文章列表](https://doris.apache.org/zh-CN/article/article-list.html) 。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/jdbc-load.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/jdbc-load.md index 1ab77c49c1b..c0012300fda 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/jdbc-load.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-scenes/jdbc-load.md @@ -1,6 +1,6 @@ --- { - "title": "使用JDBC同步数据", + "title": "使用 Insert 方式同步数据", "language": "zh-CN" } @@ -24,9 +24,9 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> -# 使用JDBC同步数据 +# 使用 Insert 方式同步数据 -用户可以通过 JDBC 协议,使用 INSERT 语句进行数据导入。 +用户可以通过 MySQL 协议,使用 INSERT 语句进行数据导入。 INSERT 语句的使用方式和 MySQL 等数据库中 INSERT 语句的使用方式类似。 INSERT 语句支持以下两种语法: @@ -79,7 +79,7 @@ public class DorisJDBCDemo { private static final String JDBC_DRIVER = "com.mysql.jdbc.Driver"; private static final String DB_URL_PATTERN = "jdbc:mysql://%s:%d/%s?rewriteBatchedStatements=true"; private static final String HOST = "127.0.0.1"; // Leader Node host - private static final int PORT = 9030; // query port of Leader Node + private static final int PORT = 9030; // query_port of Leader Node private static final String DB = "demo"; private static final String TBL = "test_1"; private static final String USER = "admin"; diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/routine-load-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/routine-load-manual.md index 91d2be35798..a87e1392e19 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/routine-load-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/routine-load-manual.md @@ -146,8 +146,9 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ```json [ - {"category":"11", - "title":"SayingsoftheCentury", + { + "category":"11", + "author":"4avc", "price":895, "timestamp":1589191587 }, @@ -160,14 +161,14 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl { "category":"33", "author":"3avc", - "title":"SayingsoftheCentury", + "price":342, "timestamp":1589191387 } ] ``` - + 创建待导入的Doris数据表 - + ```sql CREATE TABLE `example_tbl` ( `category` varchar(24) NULL COMMENT "", @@ -190,9 +191,9 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl "replication_num" = "1" ); ``` - + 以简单模式导入json数据 - + ```sql CREATE ROUTINE LOAD example_db.test_json_label_1 ON table1 COLUMNS(category,price,author) @@ -213,9 +214,9 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl "kafka_offsets" = "0,0,0" ); ``` - + 精准导入json格式数据 - + ```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl COLUMNS(category, author, price, timestamp, dt=from_unixtime(timestamp, '%Y%m%d')) @@ -239,6 +240,10 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` +**注意:** 表里的分区字段 `dt` 在我们的数据里并没有,而是在我们Routine load 语句里通过 `dt=from_unixtime(timestamp, '%Y%m%d')` 转换出来的 + + + **strict mode 与 source data 的导入关系** 这里以列类型为 TinyInt 来举例 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md index ae8c2af1c76..e07ddf3dd2d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md @@ -139,7 +139,7 @@ show frontends\G; 然后你可以看到类似下面的结果: ```sql -mysql> show frontends\G; +mysql> show frontends\G *************************** 1. row *************************** Name: 172.21.32.5_9010_1660549353220 IP: 172.21.32.5 @@ -179,7 +179,7 @@ Doris FE 的停止可以通过下面的命令完成 cd apache-doris-x.x.x/be ``` -修改 BE 配置文件 `conf/be.conf` ,这里我们主要修改两个参数:`priority_networks` 及 `storage_root` ,如果你需要更多优化配置,请参考 [BE 参数配置](../admin-manual/config/be-config)说明,进行调整。 +修改 FE 配置文件 `conf/be.conf` ,这里我们主要修改两个参数:`priority_networks'` 及 `storage_root` ,如果你需要更多优化配置,请参考 [BE 参数配置](../admin-manual/config/be-config)说明,进行调整。 1. 添加 priority_networks 参数 @@ -195,7 +195,7 @@ priority_networks=172.23.16.0/24 ``` -storage_root=/path/your/doris-meta +storage_root=/path/your/data_dir ``` >注意: @@ -227,13 +227,13 @@ ALTER SYSTEM ADD BACKEND "be_host_ip:heartbeat_service_port"; 你可以在 MySQL 命令行下执行下面的命令查看 BE 的运行状态。 ```sql -SHOW BACKENDS\G; +SHOW BACKENDS\G ``` 示例: ```sql -mysql> SHOW BACKENDS\G; +mysql> SHOW BACKENDS\G *************************** 1. row *************************** BackendId: 10003 Cluster: default_cluster --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org