Hisoka-X commented on code in PR #8048: URL: https://github.com/apache/seatunnel/pull/8048#discussion_r1842007810
########## README.md: ########## @@ -144,6 +144,7 @@ Yes, SeaTunnel is available under the Apache 2.0 License, allowing commercial us Our [Official Documentation](https://seatunnel.apache.org/docs) includes detailed guides and tutorials to help you get started. -### 7. Is there a community or support channel? +### 6. Is there a community or support channel? Join our Slack community for support and discussions: [SeaTunnel Slack](https://s.apache.org/seatunnel-slack). +more information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). Review Comment: ```suggestion Join our Slack community for support and discussions: [SeaTunnel Slack](https://s.apache.org/seatunnel-slack). More information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). ``` ########## docs/en/faq.md: ########## @@ -1,332 +1,123 @@ -# FAQs +# FAQ -## Why should I install a computing engine like Spark or Flink? +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports various data sources and destinations. You can find a detailed list on the following list: +- Supported data sources (Source): [Source List](https://seatunnel.apache.org/docs/connector-v2/source) +- Supported data destinations (Sink): [Sink List](https://seatunnel.apache.org/docs/connector-v2/sink) -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## Does SeaTunnel support batch and streaming processing? +SeaTunnel supports both batch and streaming processing modes. You can select the appropriate mode based on your specific business scenarios and needs. Batch processing is suitable for scheduled data integration tasks, while streaming processing is ideal for real-time integration and Change Data Capture (CDC). -## I have a question, and I cannot solve it by myself +## Is it necessary to install engines like Spark or Flink when using SeaTunnel? +Spark and Flink are not mandatory. SeaTunnel supports Zeta, Spark, and Flink as integration engines, allowing you to choose one based on your needs. The community highly recommends Zeta, a new generation high-performance integration engine specifically designed for integration scenarios. Zeta is affectionately called "Ultraman Zeta" by community users! The community offers extensive support for Zeta, making it the most feature-rich option. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?d...@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). +## What data transformation functions does SeaTunnel provide? +SeaTunnel supports multiple data transformation functions, including field mapping, data filtering, data format conversion, and more. You can implement data transformations through the `transform` module in the configuration file. For more details, refer to the SeaTunnel [Transform Documentation](https://seatunnel.apache.org/docs/transform-v2). -## How do I declare a variable? +## Can SeaTunnel support custom data cleansing rules? +Yes, SeaTunnel supports custom data cleansing rules. You can configure custom rules in the `transform` module, such as cleaning up dirty data, removing invalid records, or converting fields. -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? +## Does SeaTunnel support real-time incremental integration? +SeaTunnel supports incremental data integration. For example, the CDC connector allows real-time capture of data changes, which is ideal for scenarios requiring real-time data integration. -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: +## What CDC data sources are currently supported by SeaTunnel? +SeaTunnel currently supports MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, and more. For more details, refer to the [Source List](https://seatunnel.apache.org/docs/connector-v2/source). -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## How do I enable permissions required for SeaTunnel CDC integration? +Please refer to the official SeaTunnel documentation for the necessary steps to enable permissions for each connector’s CDC functionality. -``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... -``` - -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 -``` - -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? +## Does SeaTunnel support CDC from MySQL replicas? How are logs pulled? +Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: +## Does SeaTunnel support CDC integration for tables without primary keys? +No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." -``` -var = """ - whatever you want -""" -``` - -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: - -``` -var = """ -your string 1 -"""${you_var}""" your string 2""" -``` +## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? +When the structure of a data source or destination changes, SeaTunnel provides various mechanisms to adapt, such as automatically detecting and updating the schema or configuring data mapping rules. You can adjust the `schema_save_mode` or `data_save_mode` parameters to control how these changes are handled based on your needs. -Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). +For more details, refer to the answers on `schema_save_mode` and `data_save_mode` below. -## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? +## Does SeaTunnel support automatic table creation? +Before starting an integration task, you can select different handling schemes for existing table structures on the target side, controlled via the `schema_save_mode` parameter. Available options include: +- **`RECREATE_SCHEMA`**: Creates the table if it does not exist; if the table exists, it is deleted and recreated. +- **`CREATE_SCHEMA_WHEN_NOT_EXIST`**: Creates the table if it does not exist; skips creation if the table already exists. +- **`ERROR_WHEN_SCHEMA_NOT_EXIST`**: Throws an error if the table does not exist. +- **`IGNORE`**: Ignores table handling. + Many connectors currently support automatic table creation. Refer to the specific connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#schema_save_mode-enum), for more information. Review Comment: ```suggestion Many connectors currently support automatic table creation. Refer to the specific connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/connector-v2/sink/Jdbc#schema_save_mode-enum), for more information. ``` ########## docs/en/faq.md: ########## @@ -1,332 +1,123 @@ -# FAQs +# FAQ -## Why should I install a computing engine like Spark or Flink? +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports various data sources and destinations. You can find a detailed list on the following list: +- Supported data sources (Source): [Source List](https://seatunnel.apache.org/docs/connector-v2/source) +- Supported data destinations (Sink): [Sink List](https://seatunnel.apache.org/docs/connector-v2/sink) -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## Does SeaTunnel support batch and streaming processing? +SeaTunnel supports both batch and streaming processing modes. You can select the appropriate mode based on your specific business scenarios and needs. Batch processing is suitable for scheduled data integration tasks, while streaming processing is ideal for real-time integration and Change Data Capture (CDC). -## I have a question, and I cannot solve it by myself +## Is it necessary to install engines like Spark or Flink when using SeaTunnel? +Spark and Flink are not mandatory. SeaTunnel supports Zeta, Spark, and Flink as integration engines, allowing you to choose one based on your needs. The community highly recommends Zeta, a new generation high-performance integration engine specifically designed for integration scenarios. Zeta is affectionately called "Ultraman Zeta" by community users! The community offers extensive support for Zeta, making it the most feature-rich option. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?d...@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). +## What data transformation functions does SeaTunnel provide? +SeaTunnel supports multiple data transformation functions, including field mapping, data filtering, data format conversion, and more. You can implement data transformations through the `transform` module in the configuration file. For more details, refer to the SeaTunnel [Transform Documentation](https://seatunnel.apache.org/docs/transform-v2). -## How do I declare a variable? +## Can SeaTunnel support custom data cleansing rules? +Yes, SeaTunnel supports custom data cleansing rules. You can configure custom rules in the `transform` module, such as cleaning up dirty data, removing invalid records, or converting fields. -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? +## Does SeaTunnel support real-time incremental integration? +SeaTunnel supports incremental data integration. For example, the CDC connector allows real-time capture of data changes, which is ideal for scenarios requiring real-time data integration. -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: +## What CDC data sources are currently supported by SeaTunnel? +SeaTunnel currently supports MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, and more. For more details, refer to the [Source List](https://seatunnel.apache.org/docs/connector-v2/source). -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## How do I enable permissions required for SeaTunnel CDC integration? +Please refer to the official SeaTunnel documentation for the necessary steps to enable permissions for each connector’s CDC functionality. -``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... -``` - -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 -``` - -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? +## Does SeaTunnel support CDC from MySQL replicas? How are logs pulled? +Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: +## Does SeaTunnel support CDC integration for tables without primary keys? +No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." Review Comment: ```suggestion SeaTunnel does not support CDC integration for tables without primary keys. The reason is that if two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Primary keys are essential to ensure data uniqueness. ``` ########## docs/en/faq.md: ########## @@ -1,332 +1,123 @@ -# FAQs +# FAQ -## Why should I install a computing engine like Spark or Flink? +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports various data sources and destinations. You can find a detailed list on the following list: +- Supported data sources (Source): [Source List](https://seatunnel.apache.org/docs/connector-v2/source) +- Supported data destinations (Sink): [Sink List](https://seatunnel.apache.org/docs/connector-v2/sink) -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## Does SeaTunnel support batch and streaming processing? +SeaTunnel supports both batch and streaming processing modes. You can select the appropriate mode based on your specific business scenarios and needs. Batch processing is suitable for scheduled data integration tasks, while streaming processing is ideal for real-time integration and Change Data Capture (CDC). -## I have a question, and I cannot solve it by myself +## Is it necessary to install engines like Spark or Flink when using SeaTunnel? +Spark and Flink are not mandatory. SeaTunnel supports Zeta, Spark, and Flink as integration engines, allowing you to choose one based on your needs. The community highly recommends Zeta, a new generation high-performance integration engine specifically designed for integration scenarios. Zeta is affectionately called "Ultraman Zeta" by community users! The community offers extensive support for Zeta, making it the most feature-rich option. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?d...@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). +## What data transformation functions does SeaTunnel provide? +SeaTunnel supports multiple data transformation functions, including field mapping, data filtering, data format conversion, and more. You can implement data transformations through the `transform` module in the configuration file. For more details, refer to the SeaTunnel [Transform Documentation](https://seatunnel.apache.org/docs/transform-v2). -## How do I declare a variable? +## Can SeaTunnel support custom data cleansing rules? +Yes, SeaTunnel supports custom data cleansing rules. You can configure custom rules in the `transform` module, such as cleaning up dirty data, removing invalid records, or converting fields. -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? +## Does SeaTunnel support real-time incremental integration? +SeaTunnel supports incremental data integration. For example, the CDC connector allows real-time capture of data changes, which is ideal for scenarios requiring real-time data integration. -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: +## What CDC data sources are currently supported by SeaTunnel? +SeaTunnel currently supports MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, and more. For more details, refer to the [Source List](https://seatunnel.apache.org/docs/connector-v2/source). -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## How do I enable permissions required for SeaTunnel CDC integration? +Please refer to the official SeaTunnel documentation for the necessary steps to enable permissions for each connector’s CDC functionality. -``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... -``` - -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 -``` - -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? +## Does SeaTunnel support CDC from MySQL replicas? How are logs pulled? +Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: +## Does SeaTunnel support CDC integration for tables without primary keys? +No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." -``` -var = """ - whatever you want -""" -``` - -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: - -``` -var = """ -your string 1 -"""${you_var}""" your string 2""" -``` +## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? +When the structure of a data source or destination changes, SeaTunnel provides various mechanisms to adapt, such as automatically detecting and updating the schema or configuring data mapping rules. You can adjust the `schema_save_mode` or `data_save_mode` parameters to control how these changes are handled based on your needs. -Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). +For more details, refer to the answers on `schema_save_mode` and `data_save_mode` below. -## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? +## Does SeaTunnel support automatic table creation? +Before starting an integration task, you can select different handling schemes for existing table structures on the target side, controlled via the `schema_save_mode` parameter. Available options include: +- **`RECREATE_SCHEMA`**: Creates the table if it does not exist; if the table exists, it is deleted and recreated. +- **`CREATE_SCHEMA_WHEN_NOT_EXIST`**: Creates the table if it does not exist; skips creation if the table already exists. +- **`ERROR_WHEN_SCHEMA_NOT_EXIST`**: Throws an error if the table does not exist. +- **`IGNORE`**: Ignores table handling. + Many connectors currently support automatic table creation. Refer to the specific connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#schema_save_mode-enum), for more information. -Of course! See the screenshot below: +## Does SeaTunnel support handling existing data before starting a data integration task? +Yes, you can specify different processing schemes for existing data on the target side before starting an integration task, controlled via the `data_save_mode` parameter. Available options include: +- **`DROP_DATA`**: Retains the database structure but deletes the data. +- **`APPEND_DATA`**: Retains both the database structure and data. +- **`CUSTOM_PROCESSING`**: User-defined processing. +- **`ERROR_WHEN_DATA_EXISTS`**: Throws an error if data already exists. + Many connectors support handling existing data; please refer to the respective connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/connector-v2/sink/Jdbc#data_save_mode-enum). - +## Does SeaTunnel support exactly-once consistency? +SeaTunnel supports exactly-once consistency for some data sources, such as MySQL and PostgreSQL, ensuring data consistency during integration. Note that exactly-once consistency depends on the capabilities of the underlying database. - +## Can SeaTunnel execute scheduled tasks? +You can use Linux cron jobs to achieve periodic data integration, or leverage scheduling tools like DolphinScheduler to manage complex scheduled tasks. -## Does SeaTunnel have a case for configuring multiple sources, such as configuring elasticsearch and hdfs in source at the same time? +## I encountered an issue with SeaTunnel that I cannot resolve. What should I do? +If you encounter issues with SeaTunnel, here are a few ways to get help: +1. Search the [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?d...@seatunnel.apache.org) to see if someone else has faced a similar issue. +2. If you cannot find an answer, reach out to the community through [these methods](https://github.com/apache/seatunnel#contact-us). -``` -env { - ... -} +## How do I declare variables? +Would you like to declare a variable in SeaTunnel's configuration and dynamically replace it at runtime? This feature is commonly used in both scheduled and ad-hoc offline processing to replace time, date, or other variables. Here's an example: -source { - hdfs { ... } - elasticsearch { ... } - jdbc {...} -} +Define the variable in the configuration. For example, in an SQL transformation (the value in any "key = value" pair in the configuration file can be replaced with variables): +```plaintext +... transform { - ... -} - -sink { - elasticsearch { ... } -} -``` - -## Are there any HBase plugins? - -There is a HBase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . - -## How can I use SeaTunnel to write data to Hive? - -``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} -``` - -In addition, SeaTunnel has implemented a `Hive` output plugin after version `1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine has been supported from version `2.0.5`: https://github.com/apache/seatunnel/issues/910. - -## How does SeaTunnel write multiple instances of ClickHouse to achieve load balancing? - -1. Write distributed tables directly (not recommended) - -2. Add a proxy or domain name (DNS) in front of multiple instances of ClickHouse: - - ``` - { - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } - } - ``` -3. Configure multiple instances in the configuration: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. Use cluster mode: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## How can I solve OOM when SeaTunnel consumes Kafka? - -In most cases, OOM is caused by not having a rate limit for consumption. The solution is as follows: - -For the current limit of Spark consumption of Kafka: - -1. Suppose the number of partitions of Kafka `Topic 1` you consume with KafkaStream = N. - -2. Assuming that the production speed of the message producer (Producer) of `Topic 1` is K messages/second, the speed of write messages to the partition must be uniform. - -3. Suppose that, after testing, it is found that the processing capacity of Spark Executor per core per second is M. - -The following conclusions can be drawn: - -1. If you want to make Spark's consumption of `Topic 1` keep up with its production speed, then you need `spark.executor.cores` * `spark.executor.instances` >= K / M - -2. When a data delay occurs, if you want the consumption speed not to be too fast, resulting in spark executor OOM, then you need to configure `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N - -3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. - - - -## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? - -The reason is that the version of httpclient.jar that comes with the CDH version of Spark is lower, and The httpclient version that ClickHouse JDBC is based on is 4.5.2, and the package versions conflict. The solution is to replace the jar package that comes with CDH with the httpclient-4.5.2 version. - -## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can I specify that SeaTunnel starts with JDK8? - -In SeaTunnel's config file, specify the following configuration: - -```shell -spark { - ... - spark.executorEnv.JAVA_HOME="/your/java_8_home/directory" - spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory" - ... + Sql { + query = "select * from user_view where city ='${city}' and dt = '${date}'" + } } +... ``` -## What should I do if OOM always appears when running SeaTunnel in Spark local[*] mode? - -If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On YARN. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. - -## Where can I place self-written plugins or third-party jdbc.jars to be loaded by SeaTunnel? - -Place the Jar package under the specified structure of the plugins directory: +To start SeaTunnel in Zeta Local mode with variables: ```bash -cd SeaTunnel -mkdir -p plugins/my_plugins/lib -cp third-part.jar plugins/my_plugins/lib +$SEATUNNEL_HOME/bin/seatunnel.sh \ +-c $SEATUNNEL_HOME/config/your_app.conf \ +-m local[2] \ +-i city=Singapore \ +-i date=20231110 ``` -`my_plugins` can be any string. - -## How do I configure logging-related parameters in SeaTunnel-V1(Spark)? - -There are three ways to configure logging-related parameters (such as Log Level): - -- [Not recommended] Change the default `$SPARK_HOME/conf/log4j.properties`. - - This will affect all programs submitted via `$SPARK_HOME/bin/spark-submit`. -- [Not recommended] Modify logging related parameters directly in the Spark code of SeaTunnel. - - This is equivalent to hardcoding, and each change needs to be recompiled. -- [Recommended] Use the following methods to change the logging configuration in the SeaTunnel configuration file (The change only takes effect if SeaTunnel >= 1.5.5 ): - - ``` - env { - spark.driver.extraJavaOptions = "-Dlog4j.configuration=file:<file path>/log4j.properties" - spark.executor.extraJavaOptions = "-Dlog4j.configuration=file:<file path>/log4j.properties" - } - source { - ... - } - transform { - ... - } - sink { - ... - } - ``` - -The contents of the log4j configuration file for reference are as follows: - -``` -$ cat log4j.properties -log4j.rootLogger=ERROR, console +Use the `-i` or `--variable` parameter with `key=value` to specify the variable's value, where `key` matches the variable name in the configuration. For details, see: [SeaTunnel Variable Configuration](https://seatunnel.apache.org/docs/concept/config) -# set the log level for these components -log4j.logger.org=ERROR -log4j.logger.org.apache.spark=ERROR -log4j.logger.org.spark-project=ERROR -log4j.logger.org.apache.hadoop=ERROR -log4j.logger.io.netty=ERROR -log4j.logger.org.apache.zookeeper=ERROR +## How can I write multi-line text in the configuration file? +If the text is long and needs to be wrapped, you can use triple quotes to indicate the beginning and end: -# add a ConsoleAppender to the logger stdout to write to the console -log4j.appender.console=org.apache.log4j.ConsoleAppender -log4j.appender.console.layout=org.apache.log4j.PatternLayout -# use a simple message format -log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n +```plaintext +var = """ +Apache SeaTunnel is a +next-generation high-performance, +distributed, massive data integration tool. +""" ``` -## How do I configure logging related parameters in SeaTunnel-V2(Spark, Flink)? - -Currently, they cannot be set directly. you need to modify the SeaTunnel startup script. The relevant parameters are specified in the task submission command. For specific parameters, please refer to the official documents: - -- Spark official documentation: http://spark.apache.org/docs/latest/configuration.html#configuring-logging -- Flink official documentation: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/logging.html - -Reference: - -https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console - -http://spark.apache.org/docs/latest/configuration.html#configuring-logging - -https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01 - -## How do I configure logging related parameters of SeaTunnel-E2E Test? - -The log4j configuration file of `seatunnel-e2e` existed in `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties`. You can modify logging related parameters directly in the configuration file. - -For example, if you want to output more detailed logs of E2E Test, just downgrade `rootLogger.level` in the configuration file. - -## Error when writing to ClickHouse: ClassCastException - -In SeaTunnel, the data type will not be actively converted. After the Input reads the data, the corresponding -Schema. When writing ClickHouse, the field type needs to be strictly matched, and the mismatch needs to be resolved. - -Data conversion can be achieved through the following two plugins: +## How do I perform variable substitution in multi-line text? +Performing variable substitution in multi-line text can be tricky because variables cannot be enclosed within triple quotes: -1. Filter Convert plugin -2. Filter Sql plugin - -Detailed data type conversion reference: [ClickHouse Data Type Check List](https://interestinglab.github.io/seatunnel-docs/#/en/configuration/output-plugins/Clickhouse?id=clickhouse-data-type-check-list) - -Refer to issue:[#488](https://github.com/apache/seatunnel/issues/488) [#382](https://github.com/apache/seatunnel/issues/382). - -## How does SeaTunnel access kerberos-authenticated HDFS, YARN, Hive and other resources? - -Please refer to: [#590](https://github.com/apache/seatunnel/issues/590). - -## How do I troubleshoot NoClassDefFoundError, ClassNotFoundException and other issues? - -There is a high probability that there are multiple different versions of the corresponding Jar package class loaded in the Java classpath, because of the conflict of the load order, not because the Jar is really missing. Modify this SeaTunnel startup command, adding the following parameters to the spark-submit submission section, and debug in detail through the output log. - -``` -spark-submit --verbose - ... - --conf 'spark.driver.extraJavaOptions=-verbose:class' - --conf 'spark.executor.extraJavaOptions=-verbose:class' - ... +```plaintext +var = """ +your string 1 +"""${your_var}""" your string 2""" ``` -## I want to learn the source code of SeaTunnel. Where should I start? - -SeaTunnel has a completely abstract and structured code implementation, and many people have chosen SeaTunnel As a way to learn Spark. You can learn the source code from the main program entry: SeaTunnel.java - -## When SeaTunnel developers develop their own plugins, do they need to understand the SeaTunnel code? Should these plugins be integrated into the SeaTunnel project? - -The plugin developed by the developer has nothing to do with the SeaTunnel project and does not need to include your plugin code. +For more details, see: [lightbend/config#456](https://github.com/lightbend/config/issues/456). -The plugin can be completely independent from SeaTunnel project, so you can write it using Java, Scala, Maven, sbt, Gradle, or whatever you want. This is also the way we recommend developers to develop plugins. +## How do I configure logging parameters for SeaTunnel E2E Tests? +The log4j configuration file for `seatunnel-e2e` is located at `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties`. You can directly modify logging-related parameters in this configuration file. For example, to produce more detailed E2E Test logs, lower the `rootLogger.level` in the configuration file. Review Comment: ```suggestion ``` This question look likes useless for users. ########## docs/en/faq.md: ########## @@ -1,332 +1,123 @@ -# FAQs +# FAQ -## Why should I install a computing engine like Spark or Flink? +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports various data sources and destinations. You can find a detailed list on the following list: +- Supported data sources (Source): [Source List](https://seatunnel.apache.org/docs/connector-v2/source) +- Supported data destinations (Sink): [Sink List](https://seatunnel.apache.org/docs/connector-v2/sink) -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## Does SeaTunnel support batch and streaming processing? +SeaTunnel supports both batch and streaming processing modes. You can select the appropriate mode based on your specific business scenarios and needs. Batch processing is suitable for scheduled data integration tasks, while streaming processing is ideal for real-time integration and Change Data Capture (CDC). -## I have a question, and I cannot solve it by myself +## Is it necessary to install engines like Spark or Flink when using SeaTunnel? +Spark and Flink are not mandatory. SeaTunnel supports Zeta, Spark, and Flink as integration engines, allowing you to choose one based on your needs. The community highly recommends Zeta, a new generation high-performance integration engine specifically designed for integration scenarios. Zeta is affectionately called "Ultraman Zeta" by community users! The community offers extensive support for Zeta, making it the most feature-rich option. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?d...@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). +## What data transformation functions does SeaTunnel provide? +SeaTunnel supports multiple data transformation functions, including field mapping, data filtering, data format conversion, and more. You can implement data transformations through the `transform` module in the configuration file. For more details, refer to the SeaTunnel [Transform Documentation](https://seatunnel.apache.org/docs/transform-v2). -## How do I declare a variable? +## Can SeaTunnel support custom data cleansing rules? +Yes, SeaTunnel supports custom data cleansing rules. You can configure custom rules in the `transform` module, such as cleaning up dirty data, removing invalid records, or converting fields. -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? +## Does SeaTunnel support real-time incremental integration? +SeaTunnel supports incremental data integration. For example, the CDC connector allows real-time capture of data changes, which is ideal for scenarios requiring real-time data integration. -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: +## What CDC data sources are currently supported by SeaTunnel? +SeaTunnel currently supports MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, and more. For more details, refer to the [Source List](https://seatunnel.apache.org/docs/connector-v2/source). -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## How do I enable permissions required for SeaTunnel CDC integration? +Please refer to the official SeaTunnel documentation for the necessary steps to enable permissions for each connector’s CDC functionality. -``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... -``` - -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 -``` - -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? +## Does SeaTunnel support CDC from MySQL replicas? How are logs pulled? +Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: +## Does SeaTunnel support CDC integration for tables without primary keys? +No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." -``` -var = """ - whatever you want -""" -``` - -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: - -``` -var = """ -your string 1 -"""${you_var}""" your string 2""" -``` +## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? Review Comment: What's `changes` meaning? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org