This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new f48cfad Updating configs and fixing main readme to add steps for updating configs (#4255) f48cfad is described below commit f48cfad7bb9685b336fdc3e7305d7f46ddc29059 Author: Sivabalan Narayanan <sivab...@uber.com> AuthorDate: Wed Dec 8 16:21:46 2021 -0500 Updating configs and fixing main readme to add steps for updating configs (#4255) --- README.md | 3 + hudi-utils/README.md | 16 +- hudi-utils/pom.xml | 2 +- website/docs/configurations.md | 1825 +++++++++++++++++++++++++--------------- 4 files changed, 1144 insertions(+), 702 deletions(-) diff --git a/README.md b/README.md index 047650c..3b6cbd5 100644 --- a/README.md +++ b/README.md @@ -149,6 +149,9 @@ You can update multiple docs versions at the same time because each directory in Example: When you change any file in `versioned_docs/version-0.7.0/`, it will only affect the docs for version `0.7.0`. +## Configs +Configs can be automatically updated by following these steps documented at ../hudi-utils/README.md + ## Maintainer Apache Hudi Community diff --git a/hudi-utils/README.md b/hudi-utils/README.md index 21fa4cf..64e29f1 100644 --- a/hudi-utils/README.md +++ b/hudi-utils/README.md @@ -15,9 +15,21 @@ limitations under the License. --> +Execute these from hudi-utils dir <br/> +Ensure you have hudi artifacts from latest master installed <br/> +If not, execute `mvn install -DskipTests` in your hudi repo <br/> + +```shell mvn clean mvn install +``` +Set the appropriate SNAPSHOT version and execute the below commands +```shell +VERSION=0.11.0 + +java -cp target/hudi-utils-1.0-SNAPSHOT-jar-with-dependencies.jar:$HOME/.m2/repository/org/apache/hudi/hudi-utilities-bundle_2.11/$VERSION-SNAPSHOT/hudi-utilities-bundle_2.11-$VERSION-SNAPSHOT.jar:$HOME/.m2/repository/org/apache/hudi/hudi-spark-bundle_2.11/$VERSION-SNAPSHOT/hudi-spark-bundle_2.11-$VERSION-SNAPSHOT.jar:$HOME/.m2/repository/org/apache/hudi/hudi-flink-bundle_2.11/$VERSION-SNAPSHOT/hudi-flink-bundle_2.11-$VERSION-SNAPSHOT.jar:$HOME/.m2/repository/org/apache/hudi/hudi-kafka-c [...] -java -cp target/hudi-utils-1.0-SNAPSHOT-jar-with-dependencies.jar:$HOME/.m2/repository/org/apache/hudi/hudi-utilities-bundle_2.11/0.10.0-SNAPSHOT/hudi-utilities-bundle_2.11-0.10.0-SNAPSHOT.jar:$HOME/.m2/repository/org/apache/hudi/hudi-spark-bundle_2.11/0.10.0-SNAPSHOT/hudi-spark-bundle_2.11-0.10.0-SNAPSHOT.jar:$HOME/.m2/repository/org/apache/hudi/hudi-flink-bundle_2.11/0.10.0-SNAPSHOT/hudi-flink-bundle_2.11-0.10.0-SNAPSHOT.jar org.apache.hudi.utils.HoodieConfigDocGenerator +cp /tmp/configurations.md ../website/docs/configurations.md +``` -cp /tmp/configurations.md $HUDI-DIR/website/docs/configurations.md +Once complete, please put up a patch with latest configurations. \ No newline at end of file diff --git a/hudi-utils/pom.xml b/hudi-utils/pom.xml index 97b5e4e..01c319a 100644 --- a/hudi-utils/pom.xml +++ b/hudi-utils/pom.xml @@ -27,7 +27,7 @@ <properties> <jdk.version>1.8</jdk.version> - <hudi.version>0.10.0-SNAPSHOT</hudi.version> + <hudi.version>0.11.0-SNAPSHOT</hudi.version> <scala.binary.version>2.11</scala.binary.version> <hudi.spark.module>hudi-spark2</hudi.spark.module> <scala.binary.version>2.11</scala.binary.version> diff --git a/website/docs/configurations.md b/website/docs/configurations.md index d4eac08..9cad2d9 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -4,7 +4,7 @@ keywords: [ configurations, default, flink options, spark, configs, parameters ] permalink: /docs/configurations.html summary: This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels. toc: true -last_modified_at: 2021-08-30T20:08:15.950513 +last_modified_at: 2021-12-08T09:59:32.441 --- This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels. @@ -14,7 +14,8 @@ This page covers the different ways of configuring your job to write/read Hudi t - [**Write Client Configs**](#WRITE_CLIENT): Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads. - [**Metrics Configs**](#METRICS): These set of configs are used to enable monitoring and reporting of keyHudi stats and metrics. - [**Record Payload Config**](#RECORD_PAYLOAD): This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels. -- [**Environment Config**](#ENVIRONMENT_CONFIG): Instead of directly passing configurations to Hudi jobs, since 0.10.0, Hudi also supports configurations via a configuration file `hudi-default.conf` in which each line consists of a key and a value separated by whitespace or = sign. +- [**Kafka Connect Configs**](#KAFKA_CONNECT): These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables +- [**Amazon Web Services Configs**](#AWS): Please fill in the description for Config Group Name: Amazon Web Services Configs ## Spark Datasource Configs {#SPARK_DATASOURCE} These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read. @@ -81,6 +82,14 @@ Options useful for reading tables via `read.format.option(...)` --- +> #### hoodie.enable.data.skipping +> enable data skipping to boost query after doing z-order optimize for current table<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: ENABLE_DATA_SKIPPING`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + > #### as.of.instant > The query instant for time travel. Without specified this option, we query > the latest snapshot.<br></br> > **Default Value**: N/A (Required)<br></br> @@ -168,14 +177,14 @@ the dot notation eg: `a.b.c`<br></br> --- > #### hoodie.datasource.hive_sync.partition_extractor_class -> <br></br> +> Class which implements PartitionValueExtractor to extract the partition values, default 'SlashEncodedDayPartitionValueExtractor'.<br></br> > **Default Value**: > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor > (Optional)<br></br> > `Config Param: HIVE_PARTITION_EXTRACTOR_CLASS`<br></br> --- > #### hoodie.datasource.hive_sync.serde_properties -> <br></br> +> Serde properties to hive table.<br></br> > **Default Value**: N/A (Required)<br></br> > `Config Param: HIVE_TABLE_SERDE_PROPERTIES`<br></br> @@ -247,7 +256,7 @@ the dot notation eg: `a.b.c`<br></br> > #### hoodie.datasource.write.partitionpath.field > Partition path field. Value to be used at the partitionPath component of > HoodieKey. Actual value ontained by invoking .toString()<br></br> -> **Default Value**: partitionpath (Optional)<br></br> +> **Default Value**: N/A (Required)<br></br> > `Config Param: PARTITIONPATH_FIELD`<br></br> --- @@ -260,7 +269,7 @@ the dot notation eg: `a.b.c`<br></br> --- > #### hoodie.datasource.hive_sync.partition_fields -> field in the table to use for determining hive partition columns.<br></br> +> Field in the table to use for determining hive partition columns.<br></br> > **Default Value**: (Optional)<br></br> > `Config Param: HIVE_PARTITION_FIELDS`<br></br> @@ -401,7 +410,7 @@ the dot notation eg: `a.b.c`<br></br> --- > #### hoodie.datasource.hive_sync.use_pre_apache_input_format -> <br></br> +> Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format<br></br> > **Default Value**: false (Optional)<br></br> > `Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT`<br></br> @@ -547,8 +556,8 @@ Default is 'num_commits'<br></br> --- > #### index.state.ttl -> Index state ttl in days, default 1.5 day<br></br> -> **Default Value**: 1.5 (Optional)<br></br> +> Index state ttl in days, default stores the index permanently<br></br> +> **Default Value**: 0.0 (Optional)<br></br> > `Config Param: INDEX_STATE_TTL`<br></br> --- @@ -577,7 +586,7 @@ Disabled by default for backward compatibility.<br></br> > #### metadata.compaction.delta_commits > Max delta commits for metadata table to trigger compaction, default > 24<br></br> -> **Default Value**: 24 (Optional)<br></br> +> **Default Value**: 10 (Optional)<br></br> > `Config Param: METADATA_COMPACTION_DELTA_COMMITS`<br></br> --- @@ -596,6 +605,13 @@ Disabled by default for backward compatibility.<br></br> --- +> #### write.parquet.block.size +> Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.<br></br> +> **Default Value**: 120 (Optional)<br></br> +> `Config Param: WRITE_PARQUET_BLOCK_SIZE`<br></br> + +--- + > #### hive_sync.table > Table name for hive sync, default 'unknown'<br></br> > **Default Value**: unknown (Optional)<br></br> @@ -612,8 +628,8 @@ This will render any value set for the option in-effective<br></br> --- > #### compaction.tasks -> Parallelism of tasks that do actual compaction, default is 10<br></br> -> **Default Value**: 10 (Optional)<br></br> +> Parallelism of tasks that do actual compaction, default is 4<br></br> +> **Default Value**: 4 (Optional)<br></br> > `Config Param: COMPACTION_TASKS`<br></br> --- @@ -641,6 +657,13 @@ By default false (the names of partition folders are only partition values)<br>< --- +> #### compaction.timeout.seconds +> Max timeout time in seconds for online compaction to rollback, default 20 minutes<br></br> +> **Default Value**: 1200 (Optional)<br></br> +> `Config Param: COMPACTION_TIMEOUT_SECONDS`<br></br> + +--- + > #### hive_sync.username > Username for hive sync, default 'hive'<br></br> > **Default Value**: hive (Optional)<br></br> @@ -712,6 +735,20 @@ By default 3<br></br> --- +> #### write.parquet.max.file.size +> Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.<br></br> +> **Default Value**: 120 (Optional)<br></br> +> `Config Param: WRITE_PARQUET_MAX_FILE_SIZE`<br></br> + +--- + +> #### read.end-commit +> End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: READ_END_COMMIT`<br></br> + +--- + > #### write.log.max.size > Maximum size allowed in MB for a log file before it is rolled over to the > next version, default 1GB<br></br> > **Default Value**: 1024 (Optional)<br></br> @@ -741,6 +778,15 @@ By default 2000 and it will be doubled by every retry<br></br> --- +> #### write.partition.format +> Partition path format, only valid when 'write.datetime.partitioning' is true, default is: +1) 'yyyyMMddHH' for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL; +2) 'yyyyMMdd' for DAY and INT.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: PARTITION_FORMAT`<br></br> + +--- + > #### hive_sync.db > Database name for hive sync, default 'default'<br></br> > **Default Value**: default (Optional)<br></br> @@ -783,9 +829,26 @@ By default 2000 and it will be doubled by every retry<br></br> --- +> #### read.start-commit +> Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: READ_START_COMMIT`<br></br> + +--- + +> #### write.precombine +> Flag to indicate whether to drop duplicates before insert/upsert. +By default these cases will accept duplicates, to gain extra performance: +1) insert operation; +2) upsert for MOR table, the MOR table deduplicate on reading<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: PRE_COMBINE`<br></br> + +--- + > #### write.batch.size -> Batch buffer size in MB to flush data into the underneath filesystem, default 64MB<br></br> -> **Default Value**: 64.0 (Optional)<br></br> +> Batch buffer size in MB to flush data into the underneath filesystem, default 256MB<br></br> +> **Default Value**: 256.0 (Optional)<br></br> > `Config Param: WRITE_BATCH_SIZE`<br></br> --- @@ -806,24 +869,9 @@ By default 2000 and it will be doubled by every retry<br></br> > #### index.global.enabled > Whether to update index for the old partition path -if same key record with different partition path came in, default false<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: INDEX_GLOBAL_ENABLED`<br></br> - ---- - -> #### write.insert.drop.duplicates -> Flag to indicate whether to drop duplicates upon insert. -By default insert will accept duplicates, to gain extra performance<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: INSERT_DROP_DUPS`<br></br> - ---- - -> #### write.insert.deduplicate -> Whether to deduplicate for INSERT operation, if disabled, writes the base files directly, default true<br></br> +if same key record with different partition path came in, default true<br></br> > **Default Value**: true (Optional)<br></br> -> `Config Param: INSERT_DEDUP`<br></br> +> `Config Param: INDEX_GLOBAL_ENABLED`<br></br> --- @@ -834,13 +882,6 @@ By default insert will accept duplicates, to gain extra performance<br></br> --- -> #### read.streaming.start-commit -> Start commit instant for streaming read, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: READ_STREAMING_START_COMMIT`<br></br> - ---- - > #### hoodie.table.name > Table name to register to Hive metastore<br></br> > **Default Value**: N/A (Required)<br></br> @@ -864,6 +905,16 @@ otherwise a Hoodie table expects to be initialized successfully<br></br> --- +> #### read.streaming.skip_compaction +> Whether to skip compaction instants for streaming read, +there are two cases that this option can be used to avoid reading duplicates: +1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week; +2) changelog mode is enabled, this option is a solution to keep data integrity<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: READ_STREAMING_SKIP_COMPACT`<br></br> + +--- + > #### hoodie.datasource.write.partitionpath.urlencode > Whether to encode the partition path url, default false<br></br> > **Default Value**: false (Optional)<br></br> @@ -917,7 +968,7 @@ Actual value obtained by invoking .toString(), default ''<br></br> --- > #### write.bucket_assign.tasks -> Parallelism of tasks that do bucket assign, default is 4<br></br> +> Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment<br></br> > **Default Value**: N/A (Required)<br></br> > `Config Param: BUCKET_ASSIGN_TASKS`<br></br> @@ -937,9 +988,16 @@ Actual value obtained by invoking .toString(), default ''<br></br> --- +> #### write.insert.cluster +> Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default false<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: INSERT_CLUSTER`<br></br> + +--- + > #### partition.default_name > The default partition name in case the dynamic partition column value is > null/empty string<br></br> -> **Default Value**: __DEFAULT_PARTITION__ (Optional)<br></br> +> **Default Value**: default (Optional)<br></br> > `Config Param: PARTITION_DEFAULT_NAME`<br></br> --- @@ -952,8 +1010,8 @@ Actual value obtained by invoking .toString(), default ''<br></br> --- > #### compaction.target_io -> Target IO per compaction (both read and write), default 5 GB<br></br> -> **Default Value**: 5120 (Optional)<br></br> +> Target IO per compaction (both read and write), default 500 GB<br></br> +> **Default Value**: 512000 (Optional)<br></br> > `Config Param: COMPACTION_TARGET_IO`<br></br> --- @@ -1029,7 +1087,7 @@ determined by Object.compareTo(..)<br></br> --- > #### write.index_bootstrap.tasks -> Parallelism of tasks that do index bootstrap, default is 4<br></br> +> Parallelism of tasks that do index bootstrap, default is the parallelism of the execution environment<br></br> > **Default Value**: N/A (Required)<br></br> > `Config Param: INDEX_BOOTSTRAP_TASKS`<br></br> @@ -1051,6 +1109,13 @@ Actual value will be obtained by invoking .toString() on the field value. Nested --- +> #### write.parquet.page.size +> Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.<br></br> +> **Default Value**: 1 (Optional)<br></br> +> `Config Param: WRITE_PARQUET_PAGE_SIZE`<br></br> + +--- + > #### compaction.delta_seconds > Max delta seconds time needed to trigger compaction, default 1 hour<br></br> > **Default Value**: 3600 (Optional)<br></br> @@ -1082,178 +1147,658 @@ Actual value will be obtained by invoking .toString() on the field value. Nested ## Write Client Configs {#WRITE_CLIENT} Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads. -### Consistency Guard Configurations {#Consistency-Guard-Configurations} +### Write commit callback configs {#Write-commit-callback-configs} -The consistency guard related config options, to help talk to eventually consistent object storage.(Tip: S3 is NOT eventually consistent anymore!) +Controls callback behavior into HTTP endpoints, to push notifications on commits on hudi tables. -`Config Class`: org.apache.hudi.common.fs.ConsistencyGuardConfig<br></br> -> #### hoodie.optimistic.consistency.guard.sleep_time_ms -> Amount of time (in ms), to wait after which we assume storage is consistent.<br></br> -> **Default Value**: 500 (Optional)<br></br> -> `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MS`<br></br> +`Config Class`: org.apache.hudi.config.HoodieWriteCommitCallbackConfig<br></br> +> #### hoodie.write.commit.callback.on +> Turn commit callback on/off. off by default.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: TURN_CALLBACK_ON`<br></br> > `Since Version: 0.6.0`<br></br> --- -> #### hoodie.consistency.check.max_interval_ms -> Maximum amount of time (in ms), to wait for consistency checking.<br></br> -> **Default Value**: 20000 (Optional)<br></br> -> `Config Param: MAX_CHECK_INTERVAL_MS`<br></br> -> `Since Version: 0.5.0`<br></br> -> `Deprecated Version: 0.7.0`<br></br> - ---- - -> #### _hoodie.optimistic.consistency.guard.enable -> Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLE`<br></br> +> #### hoodie.write.commit.callback.http.url +> Callback host to be sent along with callback messages<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: CALLBACK_HTTP_URL`<br></br> > `Since Version: 0.6.0`<br></br> --- -> #### hoodie.consistency.check.enabled -> Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: ENABLE`<br></br> -> `Since Version: 0.5.0`<br></br> -> `Deprecated Version: 0.7.0`<br></br> +> #### hoodie.write.commit.callback.http.timeout.seconds +> Callback timeout in seconds. 3 by default<br></br> +> **Default Value**: 3 (Optional)<br></br> +> `Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS`<br></br> +> `Since Version: 0.6.0`<br></br> --- -> #### hoodie.consistency.check.max_checks -> Maximum number of consistency checks to perform, with exponential backoff.<br></br> -> **Default Value**: 6 (Optional)<br></br> -> `Config Param: MAX_CHECKS`<br></br> -> `Since Version: 0.5.0`<br></br> -> `Deprecated Version: 0.7.0`<br></br> +> #### hoodie.write.commit.callback.class +> Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default<br></br> +> **Default Value**: org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback (Optional)<br></br> +> `Config Param: CALLBACK_CLASS_NAME`<br></br> +> `Since Version: 0.6.0`<br></br> --- -> #### hoodie.consistency.check.initial_interval_ms -> Amount of time (in ms) to wait, before checking for consistency after an operation on storage.<br></br> -> **Default Value**: 400 (Optional)<br></br> -> `Config Param: INITIAL_CHECK_INTERVAL_MS`<br></br> -> `Since Version: 0.5.0`<br></br> -> `Deprecated Version: 0.7.0`<br></br> +> #### hoodie.write.commit.callback.http.api.key +> Http callback API key. hudi_write_commit_http_callback by default<br></br> +> **Default Value**: hudi_write_commit_http_callback (Optional)<br></br> +> `Config Param: CALLBACK_HTTP_API_KEY_VALUE`<br></br> +> `Since Version: 0.6.0`<br></br> --- -### Write Configurations {#Write-Configurations} +### Table Configurations {#Table-Configurations} -Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g DeltaStreamer). +Configurations that persist across writes and read on a Hudi table like base, log file formats, table name, creation schema, table version layouts. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and rarely changes during the lifetime of the table. Writers/Queries' configurations are validated against these each time for compatibility. -`Config Class`: org.apache.hudi.config.HoodieWriteConfig<br></br> -> #### hoodie.combine.before.upsert -> When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.<br></br> +`Config Class`: org.apache.hudi.common.table.HoodieTableConfig<br></br> +> #### hoodie.bootstrap.index.enable +> Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.<br></br> > **Default Value**: true (Optional)<br></br> -> `Config Param: COMBINE_BEFORE_UPSERT`<br></br> +> `Config Param: BOOTSTRAP_INDEX_ENABLE`<br></br> --- -> #### hoodie.write.markers.type -> Marker type to use. Two modes are supported: - DIRECT: individual marker file corresponding to each data file is directly created by the writer. - TIMELINE_SERVER_BASED: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. Note: timeline based markers are not yet supported for HDFS <br></br> -> **Default Value**: TIMELINE_SERVER_BASED (Optional)<br></br> -> `Config Param: MARKERS_TYPE`<br></br> -> `Since Version: 0.9.0`<br></br> +> #### hoodie.table.precombine.field +> Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: PRECOMBINE_FIELD`<br></br> --- -> #### hoodie.consistency.check.max_interval_ms -> Max time to wait between successive attempts at performing consistency checks<br></br> -> **Default Value**: 300000 (Optional)<br></br> -> `Config Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS`<br></br> +> #### hoodie.table.partition.fields +> Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: PARTITION_FIELDS`<br></br> --- -> #### hoodie.embed.timeline.server.port -> Port at which the timeline server listens for requests. When running embedded in each writer, it picks a free port and communicates to all the executors. This should rarely be changed.<br></br> -> **Default Value**: 0 (Optional)<br></br> -> `Config Param: EMBEDDED_TIMELINE_SERVER_PORT_NUM`<br></br> +> #### hoodie.populate.meta.fields +> When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: POPULATE_META_FIELDS`<br></br> --- -> #### hoodie.write.meta.key.prefixes -> Comma separated metadata key prefixes to override from latest commit during overlapping commits via multi writing<br></br> -> **Default Value**: (Optional)<br></br> -> `Config Param: WRITE_META_KEY_PREFIXES`<br></br> +> #### hoodie.compaction.payload.class +> Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br></br> +> **Default Value**: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)<br></br> +> `Config Param: PAYLOAD_CLASS_NAME`<br></br> --- -> #### hoodie.table.base.file.format -> <br></br> -> **Default Value**: PARQUET (Optional)<br></br> -> `Config Param: BASE_FILE_FORMAT`<br></br> +> #### hoodie.archivelog.folder +> path under the meta folder, to store archived timeline instants at.<br></br> +> **Default Value**: archived (Optional)<br></br> +> `Config Param: ARCHIVELOG_FOLDER`<br></br> --- -> #### hoodie.avro.schema.validate -> Validate the schema used for the write against the latest schema, for backwards compatibility.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: AVRO_SCHEMA_VALIDATE_ENABLE`<br></br> +> #### hoodie.bootstrap.index.class +> Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br></br> +> **Default Value**: org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex (Optional)<br></br> +> `Config Param: BOOTSTRAP_INDEX_CLASS_NAME`<br></br> --- -> #### hoodie.write.buffer.limit.bytes -> Size of in-memory buffer used for parallelizing network reads and lake storage writes.<br></br> -> **Default Value**: 4194304 (Optional)<br></br> -> `Config Param: WRITE_BUFFER_LIMIT_BYTES_VALUE`<br></br> +> #### hoodie.table.type +> The table type for the underlying data, for this write. This can’t change between writes.<br></br> +> **Default Value**: COPY_ON_WRITE (Optional)<br></br> +> `Config Param: TYPE`<br></br> --- -> #### hoodie.insert.shuffle.parallelism -> Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.<br></br> -> **Default Value**: 1500 (Optional)<br></br> -> `Config Param: INSERT_PARALLELISM_VALUE`<br></br> +> #### hoodie.datasource.write.partitionpath.urlencode +> Should we url encode the partition path value, before creating the folder structure.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: URL_ENCODE_PARTITIONING`<br></br> --- -> #### hoodie.embed.timeline.server.async -> Controls whether or not, the requests to the timeline server are processed in asynchronous fashion, potentially improving throughput.<br></br> +> #### hoodie.datasource.write.hive_style_partitioning +> Flag to indicate whether to use Hive style partitioning. +If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. +By default false (the names of partition folders are only partition values)<br></br> > **Default Value**: false (Optional)<br></br> -> `Config Param: EMBEDDED_TIMELINE_SERVER_USE_ASYNC_ENABLE`<br></br> +> `Config Param: HIVE_STYLE_PARTITIONING_ENABLE`<br></br> --- -> #### hoodie.rollback.parallelism -> Parallelism for rollback of commits. Rollbacks perform delete of files or logging delete blocks to file groups on storage in parallel.<br></br> -> **Default Value**: 100 (Optional)<br></br> -> `Config Param: ROLLBACK_PARALLELISM_VALUE`<br></br> +> #### hoodie.table.keygenerator.class +> Key Generator class property for the hoodie table<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: KEY_GENERATOR_CLASS_NAME`<br></br> --- -> #### hoodie.write.status.storage.level -> Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.<br></br> -> **Default Value**: MEMORY_AND_DISK_SER (Optional)<br></br> -> `Config Param: WRITE_STATUS_STORAGE_LEVEL_VALUE`<br></br> +> #### hoodie.table.version +> Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br></br> +> **Default Value**: ZERO (Optional)<br></br> +> `Config Param: VERSION`<br></br> --- -> #### hoodie.writestatus.class -> Subclass of org.apache.hudi.client.WriteStatus to be used to collect information about a write. Can be overridden to collection additional metrics/statistics about the data if needed.<br></br> -> **Default Value**: org.apache.hudi.client.WriteStatus (Optional)<br></br> -> `Config Param: WRITE_STATUS_CLASS_NAME`<br></br> +> #### hoodie.table.base.file.format +> Base file format to store all the base file data.<br></br> +> **Default Value**: PARQUET (Optional)<br></br> +> `Config Param: BASE_FILE_FORMAT`<br></br> --- -> #### hoodie.base.path -> Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.<br></br> +> #### hoodie.bootstrap.base.path +> Base path of the dataset that needs to be bootstrapped as a Hudi table<br></br> > **Default Value**: N/A (Required)<br></br> -> `Config Param: BASE_PATH`<br></br> +> `Config Param: BOOTSTRAP_BASE_PATH`<br></br> --- -> #### hoodie.allow.empty.commit -> Whether to allow generation of empty commits, even if no data was written in the commit. It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data<br></br> -> **Default Value**: true (Optional)<br></br> -> `Config Param: ALLOW_EMPTY_COMMIT`<br></br> +> #### hoodie.table.create.schema +> Schema used when creating the table, for the first time.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: CREATE_SCHEMA`<br></br> --- -> #### hoodie.bulkinsert.user.defined.partitioner.class -> If specified, this class will be used to re-partition records before they are bulk inserted. This can be used to sort, pack, cluster data optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner which can does sorting based on specified column values set by hoodie.bulkinsert.user.defined.partitioner.sort.columns<br></br> +> #### hoodie.timeline.layout.version +> Version of timeline used, by the table.<br></br> > **Default Value**: N/A (Required)<br></br> -> `Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_CLASS_NAME`<br></br> +> `Config Param: TIMELINE_LAYOUT_VERSION`<br></br> + +--- + +> #### hoodie.table.name +> Table name that will be used for registering with Hive. Needs to be same across runs.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: NAME`<br></br> + +--- + +> #### hoodie.table.recordkey.fields +> Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: RECORDKEY_FIELDS`<br></br> + +--- + +> #### hoodie.table.log.file.format +> Log format used for the delta logs.<br></br> +> **Default Value**: HOODIE_LOG (Optional)<br></br> +> `Config Param: LOG_FILE_FORMAT`<br></br> + +--- + +### Memory Configurations {#Memory-Configurations} + +Controls memory usage for compaction and merges, performed internally by Hudi. + +`Config Class`: org.apache.hudi.config.HoodieMemoryConfig<br></br> +> #### hoodie.memory.merge.fraction +> This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge<br></br> +> **Default Value**: 0.6 (Optional)<br></br> +> `Config Param: MAX_MEMORY_FRACTION_FOR_MERGE`<br></br> + +--- + +> #### hoodie.memory.dfs.buffer.max.size +> Property to control the max memory for dfs input stream buffer size<br></br> +> **Default Value**: 16777216 (Optional)<br></br> +> `Config Param: MAX_DFS_STREAM_BUFFER_SIZE`<br></br> + +--- + +> #### hoodie.memory.writestatus.failure.fraction +> Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.<br></br> +> **Default Value**: 0.1 (Optional)<br></br> +> `Config Param: WRITESTATUS_FAILURE_FRACTION`<br></br> + +--- + +> #### hoodie.memory.compaction.fraction +> HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map<br></br> +> **Default Value**: 0.6 (Optional)<br></br> +> `Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION`<br></br> + +--- + +> #### hoodie.memory.merge.max.size +> Maximum amount of memory used for merge operations, before spilling to local storage.<br></br> +> **Default Value**: 1073741824 (Optional)<br></br> +> `Config Param: MAX_MEMORY_FOR_MERGE`<br></br> + +--- + +> #### hoodie.memory.spillable.map.path +> Default file path prefix for spillable map<br></br> +> **Default Value**: /tmp/ (Optional)<br></br> +> `Config Param: SPILLABLE_MAP_BASE_PATH`<br></br> + +--- + +> #### hoodie.memory.compaction.max.size +> Maximum amount of memory used for compaction operations, before spilling to local storage.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: MAX_MEMORY_FOR_COMPACTION`<br></br> + +--- + +### Storage Configs {#Storage-Configs} + +Configurations that control aspects around writing, sizing, reading base and log files. + +`Config Class`: org.apache.hudi.config.HoodieStorageConfig<br></br> +> #### hoodie.logfile.data.block.max.size +> LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.<br></br> +> **Default Value**: 268435456 (Optional)<br></br> +> `Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE`<br></br> + +--- + +> #### hoodie.parquet.outputTimestampType +> Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files.<br></br> +> **Default Value**: TIMESTAMP_MILLIS (Optional)<br></br> +> `Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE`<br></br> + +--- + +> #### hoodie.orc.stripe.size +> Size of the memory buffer in bytes for writing<br></br> +> **Default Value**: 67108864 (Optional)<br></br> +> `Config Param: ORC_STRIPE_SIZE`<br></br> + +--- + +> #### hoodie.orc.block.size +> ORC block size, recommended to be aligned with the target file size.<br></br> +> **Default Value**: 125829120 (Optional)<br></br> +> `Config Param: ORC_BLOCK_SIZE`<br></br> + +--- + +> #### hoodie.orc.compression.codec +> Compression codec to use for ORC base files.<br></br> +> **Default Value**: ZLIB (Optional)<br></br> +> `Config Param: ORC_COMPRESSION_CODEC_NAME`<br></br> + +--- + +> #### hoodie.parquet.max.file.size +> Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.<br></br> +> **Default Value**: 125829120 (Optional)<br></br> +> `Config Param: PARQUET_MAX_FILE_SIZE`<br></br> + +--- + +> #### hoodie.hfile.max.file.size +> Target file size for HFile base files.<br></br> +> **Default Value**: 125829120 (Optional)<br></br> +> `Config Param: HFILE_MAX_FILE_SIZE`<br></br> + +--- + +> #### hoodie.parquet.writeLegacyFormat.enabled +> Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet's fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED`<br></br> + +--- + +> #### hoodie.parquet.block.size +> Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.<br></br> +> **Default Value**: 125829120 (Optional)<br></br> +> `Config Param: PARQUET_BLOCK_SIZE`<br></br> + +--- + +> #### hoodie.logfile.max.size +> LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version.<br></br> +> **Default Value**: 1073741824 (Optional)<br></br> +> `Config Param: LOGFILE_MAX_SIZE`<br></br> + +--- + +> #### hoodie.parquet.dictionary.enabled +> Whether to use dictionary encoding<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: PARQUET_DICTIONARY_ENABLED`<br></br> + +--- + +> #### hoodie.hfile.block.size +> Lower values increase the size of metadata tracked within HFile, but can offer potentially faster lookup times.<br></br> +> **Default Value**: 1048576 (Optional)<br></br> +> `Config Param: HFILE_BLOCK_SIZE`<br></br> + +--- + +> #### hoodie.parquet.page.size +> Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.<br></br> +> **Default Value**: 1048576 (Optional)<br></br> +> `Config Param: PARQUET_PAGE_SIZE`<br></br> + +--- + +> #### hoodie.hfile.compression.algorithm +> Compression codec to use for hfile base files.<br></br> +> **Default Value**: GZ (Optional)<br></br> +> `Config Param: HFILE_COMPRESSION_ALGORITHM_NAME`<br></br> + +--- + +> #### hoodie.orc.max.file.size +> Target file size for ORC base files.<br></br> +> **Default Value**: 125829120 (Optional)<br></br> +> `Config Param: ORC_FILE_MAX_SIZE`<br></br> + +--- + +> #### hoodie.logfile.to.parquet.compression.ratio +> Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.<br></br> +> **Default Value**: 0.35 (Optional)<br></br> +> `Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION`<br></br> + +--- + +> #### hoodie.parquet.compression.ratio +> Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files<br></br> +> **Default Value**: 0.1 (Optional)<br></br> +> `Config Param: PARQUET_COMPRESSION_RATIO_FRACTION`<br></br> + +--- + +> #### hoodie.parquet.compression.codec +> Compression Codec for parquet files<br></br> +> **Default Value**: gzip (Optional)<br></br> +> `Config Param: PARQUET_COMPRESSION_CODEC_NAME`<br></br> + +--- + +### Metadata Configs {#Metadata-Configs} + +Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries. + +`Config Class`: org.apache.hudi.common.config.HoodieMetadataConfig<br></br> +> #### hoodie.metadata.compact.max.delta.commits +> Controls how often the metadata table is compacted.<br></br> +> **Default Value**: 10 (Optional)<br></br> +> `Config Param: COMPACT_NUM_DELTA_COMMITS`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.assume.date.partitioning +> Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: ASSUME_DATE_PARTITIONING`<br></br> +> `Since Version: 0.3.0`<br></br> + +--- + +> #### hoodie.metadata.metrics.enable +> Enable publishing of metrics around metadata table.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: METRICS_ENABLE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.cleaner.commits.retained +> Controls retention/history for metadata table.<br></br> +> **Default Value**: 3 (Optional)<br></br> +> `Config Param: CLEANER_COMMITS_RETAINED`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### _hoodie.metadata.ignore.spurious.deletes +> There are cases when extra files are requested to be deleted from metadata table which was never added before. This configdetermines how to handle such spurious deletes<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: IGNORE_SPURIOUS_DELETES`<br></br> +> `Since Version: 0.10.10`<br></br> + +--- + +> #### hoodie.file.listing.parallelism +> Parallelism to use, when listing the table on lake storage.<br></br> +> **Default Value**: 200 (Optional)<br></br> +> `Config Param: FILE_LISTING_PARALLELISM_VALUE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.populate.meta.fields +> When enabled, populates all meta fields. When disabled, no meta fields are populated.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: POPULATE_META_FIELDS`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.metadata.enable.full.scan.log.files +> Enable full scanning of log files while reading log records. If disabled, hudi does look up of only interested entries.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: ENABLE_FULL_SCAN_LOG_FILES`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.metadata.enable +> Enable the internal metadata table which serves table metadata like level file listings<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: ENABLE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.clean.async +> Enable asynchronous cleaning for metadata table<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: ASYNC_CLEAN_ENABLE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.keep.max.commits +> Controls the archival of the metadata table’s timeline.<br></br> +> **Default Value**: 30 (Optional)<br></br> +> `Config Param: MAX_COMMITS_TO_KEEP`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.insert.parallelism +> Parallelism to use when inserting to the metadata table<br></br> +> **Default Value**: 1 (Optional)<br></br> +> `Config Param: INSERT_PARALLELISM_VALUE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.dir.filter.regex +> Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time.<br></br> +> **Default Value**: (Optional)<br></br> +> `Config Param: DIR_FILTER_REGEX`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metadata.keep.min.commits +> Controls the archival of the metadata table’s timeline.<br></br> +> **Default Value**: 20 (Optional)<br></br> +> `Config Param: MIN_COMMITS_TO_KEEP`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +### Consistency Guard Configurations {#Consistency-Guard-Configurations} + +The consistency guard related config options, to help talk to eventually consistent object storage.(Tip: S3 is NOT eventually consistent anymore!) + +`Config Class`: org.apache.hudi.common.fs.ConsistencyGuardConfig<br></br> +> #### hoodie.optimistic.consistency.guard.sleep_time_ms +> Amount of time (in ms), to wait after which we assume storage is consistent.<br></br> +> **Default Value**: 500 (Optional)<br></br> +> `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MS`<br></br> +> `Since Version: 0.6.0`<br></br> + +--- + +> #### hoodie.consistency.check.max_interval_ms +> Maximum amount of time (in ms), to wait for consistency checking.<br></br> +> **Default Value**: 20000 (Optional)<br></br> +> `Config Param: MAX_CHECK_INTERVAL_MS`<br></br> +> `Since Version: 0.5.0`<br></br> +> `Deprecated Version: 0.7.0`<br></br> + +--- + +> #### _hoodie.optimistic.consistency.guard.enable +> Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLE`<br></br> +> `Since Version: 0.6.0`<br></br> + +--- + +> #### hoodie.consistency.check.enabled +> Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: ENABLE`<br></br> +> `Since Version: 0.5.0`<br></br> +> `Deprecated Version: 0.7.0`<br></br> + +--- + +> #### hoodie.consistency.check.max_checks +> Maximum number of consistency checks to perform, with exponential backoff.<br></br> +> **Default Value**: 6 (Optional)<br></br> +> `Config Param: MAX_CHECKS`<br></br> +> `Since Version: 0.5.0`<br></br> +> `Deprecated Version: 0.7.0`<br></br> + +--- + +> #### hoodie.consistency.check.initial_interval_ms +> Amount of time (in ms) to wait, before checking for consistency after an operation on storage.<br></br> +> **Default Value**: 400 (Optional)<br></br> +> `Config Param: INITIAL_CHECK_INTERVAL_MS`<br></br> +> `Since Version: 0.5.0`<br></br> +> `Deprecated Version: 0.7.0`<br></br> + +--- + +### Write Configurations {#Write-Configurations} + +Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g DeltaStreamer). + +`Config Class`: org.apache.hudi.config.HoodieWriteConfig<br></br> +> #### hoodie.combine.before.upsert +> When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: COMBINE_BEFORE_UPSERT`<br></br> + +--- + +> #### hoodie.write.markers.type +> Marker type to use. Two modes are supported: - DIRECT: individual marker file corresponding to each data file is directly created by the writer. - TIMELINE_SERVER_BASED: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. If HDFS is used or timeline server is disabled, DIRECT markers are used as fallback even if this is configure. For Spark struct [...] +> **Default Value**: TIMELINE_SERVER_BASED (Optional)<br></br> +> `Config Param: MARKERS_TYPE`<br></br> +> `Since Version: 0.9.0`<br></br> + +--- + +> #### hoodie.consistency.check.max_interval_ms +> Max time to wait between successive attempts at performing consistency checks<br></br> +> **Default Value**: 300000 (Optional)<br></br> +> `Config Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS`<br></br> + +--- + +> #### hoodie.embed.timeline.server.port +> Port at which the timeline server listens for requests. When running embedded in each writer, it picks a free port and communicates to all the executors. This should rarely be changed.<br></br> +> **Default Value**: 0 (Optional)<br></br> +> `Config Param: EMBEDDED_TIMELINE_SERVER_PORT_NUM`<br></br> + +--- + +> #### hoodie.table.base.file.format +> <br></br> +> **Default Value**: PARQUET (Optional)<br></br> +> `Config Param: BASE_FILE_FORMAT`<br></br> + +--- + +> #### hoodie.avro.schema.validate +> Validate the schema used for the write against the latest schema, for backwards compatibility.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: AVRO_SCHEMA_VALIDATE_ENABLE`<br></br> + +--- + +> #### hoodie.write.buffer.limit.bytes +> Size of in-memory buffer used for parallelizing network reads and lake storage writes.<br></br> +> **Default Value**: 4194304 (Optional)<br></br> +> `Config Param: WRITE_BUFFER_LIMIT_BYTES_VALUE`<br></br> + +--- + +> #### hoodie.insert.shuffle.parallelism +> Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.<br></br> +> **Default Value**: 200 (Optional)<br></br> +> `Config Param: INSERT_PARALLELISM_VALUE`<br></br> + +--- + +> #### hoodie.embed.timeline.server.async +> Controls whether or not, the requests to the timeline server are processed in asynchronous fashion, potentially improving throughput.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: EMBEDDED_TIMELINE_SERVER_USE_ASYNC_ENABLE`<br></br> + +--- + +> #### hoodie.rollback.parallelism +> Parallelism for rollback of commits. Rollbacks perform delete of files or logging delete blocks to file groups on storage in parallel.<br></br> +> **Default Value**: 100 (Optional)<br></br> +> `Config Param: ROLLBACK_PARALLELISM_VALUE`<br></br> + +--- + +> #### hoodie.write.status.storage.level +> Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.<br></br> +> **Default Value**: MEMORY_AND_DISK_SER (Optional)<br></br> +> `Config Param: WRITE_STATUS_STORAGE_LEVEL_VALUE`<br></br> + +--- + +> #### hoodie.writestatus.class +> Subclass of org.apache.hudi.client.WriteStatus to be used to collect information about a write. Can be overridden to collection additional metrics/statistics about the data if needed.<br></br> +> **Default Value**: org.apache.hudi.client.WriteStatus (Optional)<br></br> +> `Config Param: WRITE_STATUS_CLASS_NAME`<br></br> + +--- + +> #### hoodie.base.path +> Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: BASE_PATH`<br></br> + +--- + +> #### hoodie.allow.empty.commit +> Whether to allow generation of empty commits, even if no data was written in the commit. It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: ALLOW_EMPTY_COMMIT`<br></br> + +--- + +> #### hoodie.bulkinsert.user.defined.partitioner.class +> If specified, this class will be used to re-partition records before they are bulk inserted. This can be used to sort, pack, cluster data optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner which can does sorting based on specified column values set by hoodie.bulkinsert.user.defined.partitioner.sort.columns<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_CLASS_NAME`<br></br> --- @@ -1278,6 +1823,14 @@ Configurations that control write behavior on Hudi tables. These can be directly --- +> #### hoodie.fileid.prefix.provider.class +> File Id Prefix provider class, that implements `org.apache.hudi.fileid.FileIdPrefixProvider`<br></br> +> **Default Value**: org.apache.hudi.table.RandomFileIdPrefixProvider (Optional)<br></br> +> `Config Param: FILEID_PREFIX_PROVIDER_CLASS`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + > #### hoodie.fail.on.timeline.archiving > Timeline archiving removes older instants from the timeline, after each > write operation, to minimize metadata overhead. Controls whether or not, the > write should be failed as well, if such archiving fails.<br></br> > **Default Value**: true (Optional)<br></br> @@ -1338,14 +1891,14 @@ Configurations that control write behavior on Hudi tables. These can be directly > #### hoodie.bulkinsert.shuffle.parallelism > For large initial imports using bulk_insert operation, controls the > parallelism to use for sort modes or custom partitioning donebefore writing > records to the table.<br></br> -> **Default Value**: 1500 (Optional)<br></br> +> **Default Value**: 200 (Optional)<br></br> > `Config Param: BULKINSERT_PARALLELISM_VALUE`<br></br> --- > #### hoodie.delete.shuffle.parallelism > Parallelism used for “delete” operation. Delete operations also performs > shuffles, similar to upsert operation.<br></br> -> **Default Value**: 1500 (Optional)<br></br> +> **Default Value**: 200 (Optional)<br></br> > `Config Param: DELETE_PARALLELISM_VALUE`<br></br> --- @@ -1423,7 +1976,7 @@ Configurations that control write behavior on Hudi tables. These can be directly > #### hoodie.upsert.shuffle.parallelism > Parallelism to use for upsert operation on the table. Upserts can shuffle > data to perform index lookups, file sizing, bin packing records > optimallyinto file groups.<br></br> -> **Default Value**: 1500 (Optional)<br></br> +> **Default Value**: 200 (Optional)<br></br> > `Config Param: UPSERT_PARALLELISM_VALUE`<br></br> --- @@ -1436,8 +1989,8 @@ Configurations that control write behavior on Hudi tables. These can be directly --- > #### hoodie.rollback.using.markers -> Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned off by default.<br></br> -> **Default Value**: false (Optional)<br></br> +> Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned on by default.<br></br> +> **Default Value**: true (Optional)<br></br> > `Config Param: ROLLBACK_USING_MARKERS_ENABLE`<br></br> --- @@ -1479,11 +2032,18 @@ Configurations that control write behavior on Hudi tables. These can be directly > #### hoodie.finalize.write.parallelism > Parallelism for the write finalization internal operation, which involves > removing any partially written files from lake storage, before committing > the write. Reduce this value, if the high number of tasks incur delays for > smaller tables or low latency writes.<br></br> -> **Default Value**: 1500 (Optional)<br></br> +> **Default Value**: 200 (Optional)<br></br> > `Config Param: FINALIZE_WRITE_PARALLELISM_VALUE`<br></br> --- +> #### hoodie.merge.small.file.group.candidates.limit +> Limits number of file groups, whose base file satisfies small-file limit, to consider for appending records during upsert operation. Only applicable to MOR tables<br></br> +> **Default Value**: 1 (Optional)<br></br> +> `Config Param: MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT`<br></br> + +--- + > #### hoodie.client.heartbeat.interval_in_ms > Writers perform heartbeats to indicate liveness. Controls how often (in ms), > such heartbeats are registered to lake storage.<br></br> > **Default Value**: 60000 (Optional)<br></br> @@ -1536,7 +2096,7 @@ By default false (the names of partition folders are only partition values)<br>< > #### hoodie.datasource.write.partitionpath.field > Partition path field. Value to be used at the partitionPath component of > HoodieKey. Actual value ontained by invoking .toString()<br></br> -> **Default Value**: partitionpath (Optional)<br></br> +> **Default Value**: N/A (Required)<br></br> > `Config Param: PARTITIONPATH_FIELD_NAME`<br></br> --- @@ -1563,7 +2123,7 @@ Configurations that control indexing behavior (when HBase based indexing is enab --- > #### hoodie.hbase.index.update.partition.path -> Only applies if index type is HBASE. When an already existing record is upserted to a new partition compared to whats in storage, this config when set, will delete old record in old paritition and will insert it as new record in new partition.<br></br> +> Only applies if index type is HBASE. When an already existing record is upserted to a new partition compared to whats in storage, this config when set, will delete old record in old partition and will insert it as new record in new partition.<br></br> > **Default Value**: false (Optional)<br></br> > `Config Param: UPDATE_PARTITION_PATH_ENABLE`<br></br> @@ -1705,48 +2265,96 @@ Configurations that control indexing behavior (when HBase based indexing is enab --- -### Write commit callback configs {#Write-commit-callback-configs} +### Write commit pulsar callback configs {#Write-commit-pulsar-callback-configs} -Controls callback behavior into HTTP endpoints, to push notifications on commits on hudi tables. +Controls notifications sent to pulsar, on events happening to a hudi table. -`Config Class`: org.apache.hudi.config.HoodieWriteCommitCallbackConfig<br></br> -> #### hoodie.write.commit.callback.on -> Turn commit callback on/off. off by default.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: TURN_CALLBACK_ON`<br></br> -> `Since Version: 0.6.0`<br></br> +`Config Class`: org.apache.hudi.utilities.callback.pulsar.HoodieWriteCommitPulsarCallbackConfig<br></br> +> #### hoodie.write.commit.callback.pulsar.operation-timeout +> Duration of waiting for completing an operation.<br></br> +> **Default Value**: 30s (Optional)<br></br> +> `Config Param: OPERATION_TIMEOUT`<br></br> +> `Since Version: 0.11.0`<br></br> --- -> #### hoodie.write.commit.callback.http.url -> Callback host to be sent along with callback messages<br></br> +> #### hoodie.write.commit.callback.pulsar.topic +> pulsar topic name to publish timeline activity into.<br></br> > **Default Value**: N/A (Required)<br></br> -> `Config Param: CALLBACK_HTTP_URL`<br></br> -> `Since Version: 0.6.0`<br></br> +> `Config Param: TOPIC`<br></br> +> `Since Version: 0.11.0`<br></br> --- -> #### hoodie.write.commit.callback.http.timeout.seconds -> Callback timeout in seconds. 3 by default<br></br> -> **Default Value**: 3 (Optional)<br></br> -> `Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS`<br></br> -> `Since Version: 0.6.0`<br></br> +> #### hoodie.write.commit.callback.pulsar.producer.block-if-queue-full +> When the queue is full, the method is blocked instead of an exception is thrown.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: PRODUCER_BLOCK_QUEUE_FULL`<br></br> +> `Since Version: 0.11.0`<br></br> --- -> #### hoodie.write.commit.callback.class -> Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default<br></br> -> **Default Value**: org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback (Optional)<br></br> -> `Config Param: CALLBACK_CLASS_NAME`<br></br> -> `Since Version: 0.6.0`<br></br> +> #### hoodie.write.commit.callback.pulsar.producer.send-timeout +> The timeout in each sending to pulsar.<br></br> +> **Default Value**: 30s (Optional)<br></br> +> `Config Param: PRODUCER_SEND_TIMEOUT`<br></br> +> `Since Version: 0.11.0`<br></br> --- -> #### hoodie.write.commit.callback.http.api.key -> Http callback API key. hudi_write_commit_http_callback by default<br></br> -> **Default Value**: hudi_write_commit_http_callback (Optional)<br></br> -> `Config Param: CALLBACK_HTTP_API_KEY_VALUE`<br></br> -> `Since Version: 0.6.0`<br></br> +> #### hoodie.write.commit.callback.pulsar.broker.service.url +> Server's url of pulsar cluster, to be used for publishing commit metadata.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: BROKER_SERVICE_URL`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.keepalive-interval +> Duration of keeping alive interval for each client broker connection.<br></br> +> **Default Value**: 30s (Optional)<br></br> +> `Config Param: KEEPALIVE_INTERVAL`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.producer.pending-total-size +> The maximum number of pending messages across partitions.<br></br> +> **Default Value**: 50000 (Optional)<br></br> +> `Config Param: PRODUCER_PENDING_SIZE`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.request-timeout +> Duration of waiting for completing a request.<br></br> +> **Default Value**: 60s (Optional)<br></br> +> `Config Param: REQUEST_TIMEOUT`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.producer.pending-queue-size +> The maximum size of a queue holding pending messages.<br></br> +> **Default Value**: 1000 (Optional)<br></br> +> `Config Param: PRODUCER_PENDING_QUEUE_SIZE`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.producer.route-mode +> Message routing logic for producers on partitioned topics.<br></br> +> **Default Value**: RoundRobinPartition (Optional)<br></br> +> `Config Param: PRODUCER_ROUTE_MODE`<br></br> +> `Since Version: 0.11.0`<br></br> + +--- + +> #### hoodie.write.commit.callback.pulsar.connection-timeout +> Duration of waiting for a connection to a broker to be established.<br></br> +> **Default Value**: 10s (Optional)<br></br> +> `Config Param: CONNECTION_TIMEOUT`<br></br> +> `Since Version: 0.11.0`<br></br> --- @@ -1809,7 +2417,7 @@ Configs that control locking mechanisms required for concurrency control betwee --- > #### hoodie.write.lock.zookeeper.lock_key -> Key name under base_path at which to create a ZNode and acquire lock. Final path on zk will look like base_path/lock_key. We recommend setting this to the table name<br></br> +> Key name under base_path at which to create a ZNode and acquire lock. Final path on zk will look like base_path/lock_key. If this parameter is not set, we would set it as the table name<br></br> > **Default Value**: N/A (Required)<br></br> > `Config Param: ZK_LOCK_KEY`<br></br> > `Since Version: 0.8.0`<br></br> @@ -1998,6 +2606,13 @@ Configurations that control compaction (merging of log files onto a new base fil --- +> #### hoodie.compaction.logfile.size.threshold +> Only if the log file size is greater than the threshold in bytes, the file group will be compacted.<br></br> +> **Default Value**: 0 (Optional)<br></br> +> `Config Param: COMPACTION_LOG_FILE_SIZE_THRESHOLD`<br></br> + +--- + > #### hoodie.clean.async > Only applies when hoodie.clean.automatic is turned on. When turned on runs > cleaner async with writing, which can speed up overall write > performance.<br></br> > **Default Value**: false (Optional)<br></br> @@ -2068,6 +2683,13 @@ Configurations that control compaction (merging of log files onto a new base fil --- +> #### hoodie.archive.automatic +> When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It's recommended to enable this, to ensure number of active commits is bounded.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: AUTO_ARCHIVE`<br></br> + +--- + > #### hoodie.copyonwrite.insert.auto.split > Config to control whether we control insert split sizes automatically based > on average record sizes. It's recommended to keep this turned on, since hand > tuning is otherwise extremely cumbersome.<br></br> > **Default Value**: true (Optional)<br></br> @@ -2124,6 +2746,13 @@ Configurations that control compaction (merging of log files onto a new base fil --- +> #### hoodie.archive.delete.parallelism +> Parallelism for deleting archived hoodie commits.<br></br> +> **Default Value**: 100 (Optional)<br></br> +> `Config Param: DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE`<br></br> + +--- + > #### hoodie.copyonwrite.insert.split.size > Number of inserts assigned for each partition/bucket for writing. We based > the default on writing out 100MB files, with at least 1kb records (100K > records per file), and over provision to 500K. As long as auto-tuning of > splits is turned on, this only affects the first write, where there is no > history to learn record sizes from.<br></br> > **Default Value**: 500000 (Optional)<br></br> @@ -2206,216 +2835,38 @@ Configurations that control how file metadata is stored by Hudi, for transaction --- -> #### hoodie.filesystem.view.remote.port -> Port to serve file system view queries, when remote. We expect this to be rarely hand configured.<br></br> -> **Default Value**: 26754 (Optional)<br></br> -> `Config Param: REMOTE_PORT_NUM`<br></br> - ---- - -> #### hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction -> Fraction of the file system view memory, to be used for holding mapping to bootstrap base files.<br></br> -> **Default Value**: 0.05 (Optional)<br></br> -> `Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION`<br></br> - ---- - -> #### hoodie.filesystem.view.spillable.clustering.mem.fraction -> Fraction of the file system view memory, to be used for holding clustering related metadata.<br></br> -> **Default Value**: 0.01 (Optional)<br></br> -> `Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION`<br></br> - ---- - -> #### hoodie.filesystem.view.rocksdb.base.path -> Path on local storage to use, when storing file system view in embedded kv store/rocksdb.<br></br> -> **Default Value**: /tmp/hoodie_timeline_rocksdb (Optional)<br></br> -> `Config Param: ROCKSDB_BASE_PATH`<br></br> - ---- - -> #### hoodie.filesystem.view.incr.timeline.sync.enable -> Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE`<br></br> - ---- - -### Table Configurations {#Table-Configurations} - -Configurations that persist across writes and read on a Hudi table like base, log file formats, table name, creation schema, table version layouts. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and rarely changes during the lifetime of the table. Writers/Queries' configurations are validated against these each time for compatibility. - -`Config Class`: org.apache.hudi.common.table.HoodieTableConfig<br></br> -> #### hoodie.bootstrap.index.enable -> Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: BOOTSTRAP_INDEX_ENABLE`<br></br> - ---- - -> #### hoodie.table.precombine.field -> Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: PRECOMBINE_FIELD`<br></br> - ---- - -> #### hoodie.table.partition.fields -> Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: PARTITION_FIELDS`<br></br> - ---- - -> #### hoodie.populate.meta.fields -> When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing<br></br> -> **Default Value**: true (Optional)<br></br> -> `Config Param: POPULATE_META_FIELDS`<br></br> - ---- - -> #### hoodie.compaction.payload.class -> Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.<br></br> -> **Default Value**: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)<br></br> -> `Config Param: PAYLOAD_CLASS_NAME`<br></br> - ---- - -> #### hoodie.archivelog.folder -> path under the meta folder, to store archived timeline instants at.<br></br> -> **Default Value**: archived (Optional)<br></br> -> `Config Param: ARCHIVELOG_FOLDER`<br></br> - ---- - -> #### hoodie.bootstrap.index.class -> Implementation to use, for mapping base files to bootstrap base file, that contain actual data.<br></br> -> **Default Value**: org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex (Optional)<br></br> -> `Config Param: BOOTSTRAP_INDEX_CLASS_NAME`<br></br> - ---- - -> #### hoodie.table.type -> The table type for the underlying data, for this write. This can’t change between writes.<br></br> -> **Default Value**: COPY_ON_WRITE (Optional)<br></br> -> `Config Param: TYPE`<br></br> - ---- - -> #### hoodie.table.keygenerator.class -> Key Generator class property for the hoodie table<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: KEY_GENERATOR_CLASS_NAME`<br></br> - ---- - -> #### hoodie.table.version -> Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.<br></br> -> **Default Value**: ZERO (Optional)<br></br> -> `Config Param: VERSION`<br></br> - ---- - -> #### hoodie.table.base.file.format -> Base file format to store all the base file data.<br></br> -> **Default Value**: PARQUET (Optional)<br></br> -> `Config Param: BASE_FILE_FORMAT`<br></br> - ---- - -> #### hoodie.bootstrap.base.path -> Base path of the dataset that needs to be bootstrapped as a Hudi table<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: BOOTSTRAP_BASE_PATH`<br></br> - ---- - -> #### hoodie.table.create.schema -> Schema used when creating the table, for the first time.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: CREATE_SCHEMA`<br></br> - ---- - -> #### hoodie.timeline.layout.version -> Version of timeline used, by the table.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: TIMELINE_LAYOUT_VERSION`<br></br> - ---- - -> #### hoodie.table.name -> Table name that will be used for registering with Hive. Needs to be same across runs.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: NAME`<br></br> - ---- - -> #### hoodie.table.recordkey.fields -> Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: RECORDKEY_FIELDS`<br></br> - ---- - -> #### hoodie.table.log.file.format -> Log format used for the delta logs.<br></br> -> **Default Value**: HOODIE_LOG (Optional)<br></br> -> `Config Param: LOG_FILE_FORMAT`<br></br> - ---- - -### Memory Configurations {#Memory-Configurations} - -Controls memory usage for compaction and merges, performed internally by Hudi. - -`Config Class`: org.apache.hudi.config.HoodieMemoryConfig<br></br> -> #### hoodie.memory.merge.fraction -> This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge<br></br> -> **Default Value**: 0.6 (Optional)<br></br> -> `Config Param: MAX_MEMORY_FRACTION_FOR_MERGE`<br></br> - ---- - -> #### hoodie.memory.dfs.buffer.max.size -> Property to control the max memory for dfs input stream buffer size<br></br> -> **Default Value**: 16777216 (Optional)<br></br> -> `Config Param: MAX_DFS_STREAM_BUFFER_SIZE`<br></br> - ---- - -> #### hoodie.memory.writestatus.failure.fraction -> Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.<br></br> -> **Default Value**: 0.1 (Optional)<br></br> -> `Config Param: WRITESTATUS_FAILURE_FRACTION`<br></br> - +> #### hoodie.filesystem.view.remote.port +> Port to serve file system view queries, when remote. We expect this to be rarely hand configured.<br></br> +> **Default Value**: 26754 (Optional)<br></br> +> `Config Param: REMOTE_PORT_NUM`<br></br> + --- -> #### hoodie.memory.compaction.fraction -> HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map<br></br> -> **Default Value**: 0.6 (Optional)<br></br> -> `Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION`<br></br> +> #### hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction +> Fraction of the file system view memory, to be used for holding mapping to bootstrap base files.<br></br> +> **Default Value**: 0.05 (Optional)<br></br> +> `Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION`<br></br> --- -> #### hoodie.memory.merge.max.size -> Maximum amount of memory used for merge operations, before spilling to local storage.<br></br> -> **Default Value**: 1073741824 (Optional)<br></br> -> `Config Param: MAX_MEMORY_FOR_MERGE`<br></br> +> #### hoodie.filesystem.view.spillable.clustering.mem.fraction +> Fraction of the file system view memory, to be used for holding clustering related metadata.<br></br> +> **Default Value**: 0.01 (Optional)<br></br> +> `Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION`<br></br> --- -> #### hoodie.memory.spillable.map.path -> Default file path prefix for spillable map<br></br> -> **Default Value**: /tmp/ (Optional)<br></br> -> `Config Param: SPILLABLE_MAP_BASE_PATH`<br></br> +> #### hoodie.filesystem.view.rocksdb.base.path +> Path on local storage to use, when storing file system view in embedded kv store/rocksdb.<br></br> +> **Default Value**: /tmp/hoodie_timeline_rocksdb (Optional)<br></br> +> `Config Param: ROCKSDB_BASE_PATH`<br></br> --- -> #### hoodie.memory.compaction.max.size -> Maximum amount of memory used for compaction operations, before spilling to local storage.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: MAX_MEMORY_FOR_COMPACTION`<br></br> +> #### hoodie.filesystem.view.incr.timeline.sync.enable +> Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE`<br></br> --- @@ -2557,116 +3008,6 @@ Configurations that control indexing behavior, which tags incoming records as ei --- -### Storage Configs {#Storage-Configs} - -Configurations that control aspects around writing, sizing, reading base and log files. - -`Config Class`: org.apache.hudi.config.HoodieStorageConfig<br></br> -> #### hoodie.logfile.data.block.max.size -> LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.<br></br> -> **Default Value**: 268435456 (Optional)<br></br> -> `Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE`<br></br> - ---- - -> #### hoodie.orc.stripe.size -> Size of the memory buffer in bytes for writing<br></br> -> **Default Value**: 67108864 (Optional)<br></br> -> `Config Param: ORC_STRIPE_SIZE`<br></br> - ---- - -> #### hoodie.orc.block.size -> ORC block size, recommended to be aligned with the target file size.<br></br> -> **Default Value**: 125829120 (Optional)<br></br> -> `Config Param: ORC_BLOCK_SIZE`<br></br> - ---- - -> #### hoodie.orc.compression.codec -> Compression codec to use for ORC base files.<br></br> -> **Default Value**: ZLIB (Optional)<br></br> -> `Config Param: ORC_COMPRESSION_CODEC_NAME`<br></br> - ---- - -> #### hoodie.parquet.max.file.size -> Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.<br></br> -> **Default Value**: 125829120 (Optional)<br></br> -> `Config Param: PARQUET_MAX_FILE_SIZE`<br></br> - ---- - -> #### hoodie.hfile.max.file.size -> Target file size for HFile base files.<br></br> -> **Default Value**: 125829120 (Optional)<br></br> -> `Config Param: HFILE_MAX_FILE_SIZE`<br></br> - ---- - -> #### hoodie.parquet.block.size -> Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.<br></br> -> **Default Value**: 125829120 (Optional)<br></br> -> `Config Param: PARQUET_BLOCK_SIZE`<br></br> - ---- - -> #### hoodie.logfile.max.size -> LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version.<br></br> -> **Default Value**: 1073741824 (Optional)<br></br> -> `Config Param: LOGFILE_MAX_SIZE`<br></br> - ---- - -> #### hoodie.hfile.block.size -> Lower values increase the size of metadata tracked within HFile, but can offer potentially faster lookup times.<br></br> -> **Default Value**: 1048576 (Optional)<br></br> -> `Config Param: HFILE_BLOCK_SIZE`<br></br> - ---- - -> #### hoodie.parquet.page.size -> Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.<br></br> -> **Default Value**: 1048576 (Optional)<br></br> -> `Config Param: PARQUET_PAGE_SIZE`<br></br> - ---- - -> #### hoodie.hfile.compression.algorithm -> Compression codec to use for hfile base files.<br></br> -> **Default Value**: GZ (Optional)<br></br> -> `Config Param: HFILE_COMPRESSION_ALGORITHM_NAME`<br></br> - ---- - -> #### hoodie.orc.max.file.size -> Target file size for ORC base files.<br></br> -> **Default Value**: 125829120 (Optional)<br></br> -> `Config Param: ORC_FILE_MAX_SIZE`<br></br> - ---- - -> #### hoodie.logfile.to.parquet.compression.ratio -> Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.<br></br> -> **Default Value**: 0.35 (Optional)<br></br> -> `Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION`<br></br> - ---- - -> #### hoodie.parquet.compression.ratio -> Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files<br></br> -> **Default Value**: 0.1 (Optional)<br></br> -> `Config Param: PARQUET_COMPRESSION_RATIO_FRACTION`<br></br> - ---- - -> #### hoodie.parquet.compression.codec -> Compression Codec for parquet files<br></br> -> **Default Value**: gzip (Optional)<br></br> -> `Config Param: PARQUET_COMPRESSION_CODEC_NAME`<br></br> - ---- - ### Clustering Configs {#Clustering-Configs} Configurations that control the clustering table service in hudi, which optimizes the storage layout for better query performance by sorting and sizing data files. @@ -2674,7 +3015,7 @@ Configurations that control the clustering table service in hudi, which optimize `Config Class`: org.apache.hudi.config.HoodieClusteringConfig<br></br> > #### hoodie.clustering.preserve.commit.metadata > When rewriting data, preserves existing hoodie_commit_time<br></br> -> **Default Value**: false (Optional)<br></br> +> **Default Value**: true (Optional)<br></br> > `Config Param: PRESERVE_COMMIT_METADATA`<br></br> > `Since Version: 0.9.0`<br></br> @@ -2688,6 +3029,22 @@ Configurations that control the clustering table service in hudi, which optimize --- +> #### hoodie.layout.optimize.curve.build.method +> Controls how data is sampled to build the space filling curves. two methods: `direct`,`sample`.The direct method is faster than the sampling, however sample method would produce a better data layout.<br></br> +> **Default Value**: direct (Optional)<br></br> +> `Config Param: LAYOUT_OPTIMIZE_CURVE_BUILD_METHOD`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.clustering.rollback.pending.replacecommit.on.conflict +> If updates are allowed to file groups pending clustering, then set this config to rollback failed or pending clustering instants. Pending clustering will be rolled back ONLY IF there is conflict between incoming upsert and filegroup to be clustered. Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in rare scenarios, for example, when the clustering completes after instants are fetched but before rol [...] +> **Default Value**: false (Optional)<br></br> +> `Config Param: ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + > #### hoodie.clustering.async.max.commits > Config to control frequency of async clustering<br></br> > **Default Value**: 4 (Optional)<br></br> @@ -2696,6 +3053,14 @@ Configurations that control the clustering table service in hudi, which optimize --- +> #### hoodie.layout.optimize.data.skipping.enable +> Enable data skipping by collecting statistics once layout optimization is complete.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: LAYOUT_OPTIMIZE_DATA_SKIPPING_ENABLE`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + > #### hoodie.clustering.inline.max.commits > Config to control frequency of clustering planning<br></br> > **Default Value**: 4 (Optional)<br></br> @@ -2704,6 +3069,14 @@ Configurations that control the clustering table service in hudi, which optimize --- +> #### hoodie.layout.optimize.enable +> Enable use z-ordering/space-filling curves to optimize the layout of table to boost query performance. This parameter takes precedence over clustering strategy set using hoodie.clustering.execution.strategy.class<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: LAYOUT_OPTIMIZE_ENABLE`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + > #### hoodie.clustering.plan.strategy.target.file.max.bytes > Each group can produce 'N' > (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file > groups<br></br> > **Default Value**: 1073741824 (Optional)<br></br> @@ -2753,170 +3126,77 @@ Configurations that control the clustering table service in hudi, which optimize --- > #### hoodie.clustering.plan.strategy.class -> Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the last N (determined by hoodie.clustering.plan.strategy.daybased.lookback.partitions) day based partitions picks the small file slices within those partitions.<br></br> -> **Default Value**: org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy (Optional)<br></br> +> Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the clustering small file size limit (determined by hoodie.clustering.plan.strategy.small.file.limit) to pick the small file slices within partitions for clustering.<br></br> +> **Default Value**: org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy (Optional)<br></br> > `Config Param: PLAN_STRATEGY_CLASS_NAME`<br></br> > `Since Version: 0.7.0`<br></br> --- -> #### hoodie.clustering.updates.strategy -> Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the update<br></br> -> **Default Value**: org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy (Optional)<br></br> -> `Config Param: UPDATES_STRATEGY`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.clustering.inline -> Turn on inline clustering - clustering will be run after each write operation is complete<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: INLINE_CLUSTERING`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.clustering.plan.strategy.sort.columns -> Columns to sort the data by when clustering<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: PLAN_STRATEGY_SORT_COLUMNS`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.clustering.plan.strategy.daybased.lookback.partitions -> Number of partitions to list to create ClusteringPlan<br></br> -> **Default Value**: 2 (Optional)<br></br> -> `Config Param: DAYBASED_LOOKBACK_PARTITIONS`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -### Common Configurations {#Common-Configurations} - -The following set of configurations are common across Hudi. - -`Config Class`: org.apache.hudi.common.config.HoodieCommonConfig<br></br> -> #### hoodie.common.diskmap.compression.enabled -> Turn on compression for BITCASK disk map used by the External Spillable Map<br></br> -> **Default Value**: true (Optional)<br></br> -> `Config Param: DISK_MAP_BITCASK_COMPRESSION_ENABLED`<br></br> - ---- - -> #### hoodie.common.spillable.diskmap.type -> When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to `ROCKS_DB` to prefer using rocksDB, for handling the spill.<br></br> -> **Default Value**: BITCASK (Optional)<br></br> -> `Config Param: SPILLABLE_DISK_MAP_TYPE`<br></br> - ---- - -### Metadata Configs {#Metadata-Configs} - -Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries. - -`Config Class`: org.apache.hudi.common.config.HoodieMetadataConfig<br></br> -> #### hoodie.metadata.compact.max.delta.commits -> Controls how often the metadata table is compacted.<br></br> -> **Default Value**: 24 (Optional)<br></br> -> `Config Param: COMPACT_NUM_DELTA_COMMITS`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.assume.date.partitioning -> Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: ASSUME_DATE_PARTITIONING`<br></br> -> `Since Version: 0.3.0`<br></br> - ---- - -> #### hoodie.metadata.validate -> Validate contents of metadata table on each access; e.g against the actual listings from lake storage<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: VALIDATE_ENABLE`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.metadata.metrics.enable -> Enable publishing of metrics around metadata table.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: METRICS_ENABLE`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.metadata.cleaner.commits.retained -> Controls retention/history for metadata table.<br></br> -> **Default Value**: 3 (Optional)<br></br> -> `Config Param: CLEANER_COMMITS_RETAINED`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.file.listing.parallelism -> Parallelism to use, when listing the table on lake storage.<br></br> -> **Default Value**: 1500 (Optional)<br></br> -> `Config Param: FILE_LISTING_PARALLELISM_VALUE`<br></br> -> `Since Version: 0.7.0`<br></br> +> #### hoodie.layout.optimize.build.curve.sample.size +> when settinghoodie.layout.optimize.curve.build.method to `sample`, the amount of sampling to be done.Large sample size leads to better results, at the expense of more memory usage.<br></br> +> **Default Value**: 200000 (Optional)<br></br> +> `Config Param: LAYOUT_OPTIMIZE_BUILD_CURVE_SAMPLE_SIZE`<br></br> +> `Since Version: 0.10.0`<br></br> --- -> #### hoodie.metadata.enable -> Enable the internal metadata table which serves table metadata like level file listings<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: ENABLE`<br></br> +> #### hoodie.clustering.updates.strategy +> Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the update<br></br> +> **Default Value**: org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy (Optional)<br></br> +> `Config Param: UPDATES_STRATEGY`<br></br> > `Since Version: 0.7.0`<br></br> --- -> #### hoodie.metadata.sync.enable -> Enable syncing of metadata table from actions on the dataset<br></br> -> **Default Value**: true (Optional)<br></br> -> `Config Param: SYNC_ENABLE`<br></br> -> `Since Version: 0.9.0`<br></br> +> #### hoodie.layout.optimize.strategy +> Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves.<br></br> +> **Default Value**: z-order (Optional)<br></br> +> `Config Param: LAYOUT_OPTIMIZE_STRATEGY`<br></br> +> `Since Version: 0.10.0`<br></br> --- -> #### hoodie.metadata.clean.async -> Enable asynchronous cleaning for metadata table<br></br> +> #### hoodie.clustering.inline +> Turn on inline clustering - clustering will be run after each write operation is complete<br></br> > **Default Value**: false (Optional)<br></br> -> `Config Param: ASYNC_CLEAN_ENABLE`<br></br> +> `Config Param: INLINE_CLUSTERING`<br></br> > `Since Version: 0.7.0`<br></br> --- -> #### hoodie.metadata.keep.max.commits -> Controls the archival of the metadata table’s timeline.<br></br> -> **Default Value**: 30 (Optional)<br></br> -> `Config Param: MAX_COMMITS_TO_KEEP`<br></br> +> #### hoodie.clustering.plan.strategy.sort.columns +> Columns to sort the data by when clustering<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: PLAN_STRATEGY_SORT_COLUMNS`<br></br> > `Since Version: 0.7.0`<br></br> --- -> #### hoodie.metadata.insert.parallelism -> Parallelism to use when inserting to the metadata table<br></br> -> **Default Value**: 1 (Optional)<br></br> -> `Config Param: INSERT_PARALLELISM_VALUE`<br></br> +> #### hoodie.clustering.plan.strategy.daybased.lookback.partitions +> Number of partitions to list to create ClusteringPlan<br></br> +> **Default Value**: 2 (Optional)<br></br> +> `Config Param: DAYBASED_LOOKBACK_PARTITIONS`<br></br> > `Since Version: 0.7.0`<br></br> --- -> #### hoodie.metadata.dir.filter.regex -> Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time.<br></br> -> **Default Value**: (Optional)<br></br> -> `Config Param: DIR_FILTER_REGEX`<br></br> -> `Since Version: 0.7.0`<br></br> +### Common Configurations {#Common-Configurations} + +The following set of configurations are common across Hudi. + +`Config Class`: org.apache.hudi.common.config.HoodieCommonConfig<br></br> +> #### hoodie.common.diskmap.compression.enabled +> Turn on compression for BITCASK disk map used by the External Spillable Map<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: DISK_MAP_BITCASK_COMPRESSION_ENABLED`<br></br> --- -> #### hoodie.metadata.keep.min.commits -> Controls the archival of the metadata table’s timeline.<br></br> -> **Default Value**: 20 (Optional)<br></br> -> `Config Param: MIN_COMMITS_TO_KEEP`<br></br> -> `Since Version: 0.7.0`<br></br> +> #### hoodie.common.spillable.diskmap.type +> When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to `ROCKS_DB` to prefer using rocksDB, for handling the spill.<br></br> +> **Default Value**: BITCASK (Optional)<br></br> +> `Config Param: SPILLABLE_DISK_MAP_TYPE`<br></br> --- @@ -3012,11 +3292,19 @@ These set of configs are used to enable monitoring and reporting of keyHudi stat Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishes metrics on every commit, clean, rollback etc. -`Config Class`: org.apache.hudi.config.HoodieMetricsDatadogConfig<br></br> -> #### hoodie.metrics.datadog.api.key.skip.validation -> Before sending metrics via Datadog API, whether to skip validating Datadog API key or not. Default to false.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: API_KEY_SKIP_VALIDATION`<br></br> +`Config Class`: org.apache.hudi.config.metrics.HoodieMetricsDatadogConfig<br></br> +> #### hoodie.metrics.datadog.api.timeout.seconds +> Datadog API timeout in seconds. Default to 3.<br></br> +> **Default Value**: 3 (Optional)<br></br> +> `Config Param: API_TIMEOUT_IN_SECONDS`<br></br> +> `Since Version: 0.6.0`<br></br> + +--- + +> #### hoodie.metrics.datadog.metric.prefix +> Datadog metric prefix to be prepended to each metric name with a dot as delimiter. For example, if it is set to foo, foo. will be prepended.<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: METRIC_PREFIX_VALUE`<br></br> > `Since Version: 0.6.0`<br></br> --- @@ -3029,6 +3317,14 @@ Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishe --- +> #### hoodie.metrics.datadog.api.key.skip.validation +> Before sending metrics via Datadog API, whether to skip validating Datadog API key or not. Default to false.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: API_KEY_SKIP_VALIDATION`<br></br> +> `Since Version: 0.6.0`<br></br> + +--- + > #### hoodie.metrics.datadog.metric.host > Datadog metric host to be sent along with metrics data.<br></br> > **Default Value**: N/A (Required)<br></br> @@ -3037,18 +3333,18 @@ Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishe --- -> #### hoodie.metrics.datadog.metric.prefix -> Datadog metric prefix to be prepended to each metric name with a dot as delimiter. For example, if it is set to foo, foo. will be prepended.<br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: METRIC_PREFIX_VALUE`<br></br> +> #### hoodie.metrics.datadog.report.period.seconds +> Datadog reporting period in seconds. Default to 30.<br></br> +> **Default Value**: 30 (Optional)<br></br> +> `Config Param: REPORT_PERIOD_IN_SECONDS`<br></br> > `Since Version: 0.6.0`<br></br> --- -> #### hoodie.metrics.datadog.api.timeout.seconds -> Datadog API timeout in seconds. Default to 3.<br></br> -> **Default Value**: 3 (Optional)<br></br> -> `Config Param: API_TIMEOUT_IN_SECONDS`<br></br> +> #### hoodie.metrics.datadog.api.key +> Datadog API key<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: API_KEY`<br></br> > `Since Version: 0.6.0`<br></br> --- @@ -3069,29 +3365,71 @@ Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishe --- -> #### hoodie.metrics.datadog.report.period.seconds -> Datadog reporting period in seconds. Default to 30.<br></br> -> **Default Value**: 30 (Optional)<br></br> -> `Config Param: REPORT_PERIOD_IN_SECONDS`<br></br> -> `Since Version: 0.6.0`<br></br> +### Metrics Configurations {#Metrics-Configurations} ---- +Enables reporting on Hudi metrics. Hudi publishes metrics on every commit, clean, rollback etc. The following sections list the supported reporters. -> #### hoodie.metrics.datadog.api.key -> Datadog API key<br></br> +`Config Class`: org.apache.hudi.config.metrics.HoodieMetricsConfig<br></br> +> #### hoodie.metrics.executor.enable +> <br></br> > **Default Value**: N/A (Required)<br></br> -> `Config Param: API_KEY`<br></br> +> `Config Param: EXECUTOR_METRICS_ENABLE`<br></br> +> `Since Version: 0.7.0`<br></br> + +--- + +> #### hoodie.metrics.reporter.type +> Type of metrics reporter.<br></br> +> **Default Value**: GRAPHITE (Optional)<br></br> +> `Config Param: METRICS_REPORTER_TYPE_VALUE`<br></br> +> `Since Version: 0.5.0`<br></br> + +--- + +> #### hoodie.metrics.reporter.class +> <br></br> +> **Default Value**: (Optional)<br></br> +> `Config Param: METRICS_REPORTER_CLASS_NAME`<br></br> > `Since Version: 0.6.0`<br></br> --- +> #### hoodie.metrics.on +> Turn on/off metrics reporting. off by default.<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: TURN_METRICS_ON`<br></br> +> `Since Version: 0.5.0`<br></br> + +--- + +### Metrics Configurations for Jmx {#Metrics-Configurations-for-Jmx} + +Enables reporting on Hudi metrics using Jmx. Hudi publishes metrics on every commit, clean, rollback etc. + +`Config Class`: org.apache.hudi.config.metrics.HoodieMetricsJmxConfig<br></br> +> #### hoodie.metrics.jmx.host +> Jmx host to connect to<br></br> +> **Default Value**: localhost (Optional)<br></br> +> `Config Param: JMX_HOST_NAME`<br></br> +> `Since Version: 0.5.1`<br></br> + +--- + +> #### hoodie.metrics.jmx.port +> Jmx port to connect to<br></br> +> **Default Value**: 9889 (Optional)<br></br> +> `Config Param: JMX_PORT_NUM`<br></br> +> `Since Version: 0.5.1`<br></br> + +--- + ### Metrics Configurations for Prometheus {#Metrics-Configurations-for-Prometheus} Enables reporting on Hudi metrics using Prometheus. Hudi publishes metrics on every commit, clean, rollback etc. -`Config Class`: org.apache.hudi.config.HoodieMetricsPrometheusConfig<br></br> +`Config Class`: org.apache.hudi.config.metrics.HoodieMetricsPrometheusConfig<br></br> > #### hoodie.metrics.pushgateway.random.job.name.suffix -> <br></br> +> Whether the pushgateway name need a random suffix , default true.<br></br> > **Default Value**: true (Optional)<br></br> > `Config Param: PUSHGATEWAY_RANDOM_JOBNAME_SUFFIX`<br></br> > `Since Version: 0.6.0`<br></br> @@ -3107,7 +3445,7 @@ Enables reporting on Hudi metrics using Prometheus. Hudi publishes metrics on e --- > #### hoodie.metrics.pushgateway.delete.on.shutdown -> <br></br> +> Delete the pushgateway info or not when job shutdown, true by default.<br></br> > **Default Value**: true (Optional)<br></br> > `Config Param: PUSHGATEWAY_DELETE_ON_SHUTDOWN_ENABLE`<br></br> > `Since Version: 0.6.0`<br></br> @@ -3139,82 +3477,79 @@ Enables reporting on Hudi metrics using Prometheus. Hudi publishes metrics on e --- > #### hoodie.metrics.pushgateway.host -> Hostname of the prometheus push gateway<br></br> +> Hostname of the prometheus push gateway.<br></br> > **Default Value**: localhost (Optional)<br></br> > `Config Param: PUSHGATEWAY_HOST_NAME`<br></br> > `Since Version: 0.6.0`<br></br> --- -### Metrics Configurations {#Metrics-Configurations} +### Metrics Configurations for Amazon CloudWatch {#Metrics-Configurations-for-Amazon-CloudWatch} -Enables reporting on Hudi metrics. Hudi publishes metrics on every commit, clean, rollback etc. The following sections list the supported reporters. +Enables reporting on Hudi metrics using Amazon CloudWatch. Hudi publishes metrics on every commit, clean, rollback etc. -`Config Class`: org.apache.hudi.config.HoodieMetricsConfig<br></br> -> #### hoodie.metrics.reporter.type -> Type of metrics reporter.<br></br> -> **Default Value**: GRAPHITE (Optional)<br></br> -> `Config Param: METRICS_REPORTER_TYPE_VALUE`<br></br> -> `Since Version: 0.5.0`<br></br> +`Config Class`: org.apache.hudi.config.HoodieMetricsCloudWatchConfig<br></br> +> #### hoodie.metrics.cloudwatch.report.period.seconds +> Reporting interval in seconds<br></br> +> **Default Value**: 60 (Optional)<br></br> +> `Config Param: REPORT_PERIOD_SECONDS`<br></br> +> `Since Version: 0.10.0`<br></br> --- -> #### hoodie.metrics.jmx.host -> Jmx host to connect to<br></br> -> **Default Value**: localhost (Optional)<br></br> -> `Config Param: JMX_HOST_NAME`<br></br> -> `Since Version: 0.5.1`<br></br> +> #### hoodie.metrics.cloudwatch.namespace +> Namespace of reporter<br></br> +> **Default Value**: Hudi (Optional)<br></br> +> `Config Param: METRIC_NAMESPACE`<br></br> +> `Since Version: 0.10.0`<br></br> --- -> #### hoodie.metrics.reporter.class -> <br></br> +> #### hoodie.metrics.cloudwatch.metric.prefix +> Metric prefix of reporter<br></br> > **Default Value**: (Optional)<br></br> -> `Config Param: METRICS_REPORTER_CLASS_NAME`<br></br> -> `Since Version: 0.6.0`<br></br> +> `Config Param: METRIC_PREFIX`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.metrics.cloudwatch.maxDatumsPerRequest +> Max number of Datums per request<br></br> +> **Default Value**: 20 (Optional)<br></br> +> `Config Param: MAX_DATUMS_PER_REQUEST`<br></br> +> `Since Version: 0.10.0`<br></br> --- +### Metrics Configurations for Graphite {#Metrics-Configurations-for-Graphite} + +Enables reporting on Hudi metrics using Graphite. Hudi publishes metrics on every commit, clean, rollback etc. + +`Config Class`: org.apache.hudi.config.metrics.HoodieMetricsGraphiteConfig<br></br> > #### hoodie.metrics.graphite.port -> Graphite port to connect to<br></br> +> Graphite port to connect to.<br></br> > **Default Value**: 4756 (Optional)<br></br> > `Config Param: GRAPHITE_SERVER_PORT_NUM`<br></br> > `Since Version: 0.5.0`<br></br> --- -> #### hoodie.metrics.executor.enable -> <br></br> -> **Default Value**: N/A (Required)<br></br> -> `Config Param: EXECUTOR_METRICS_ENABLE`<br></br> -> `Since Version: 0.7.0`<br></br> - ---- - -> #### hoodie.metrics.jmx.port -> Jmx port to connect to<br></br> -> **Default Value**: 9889 (Optional)<br></br> -> `Config Param: JMX_PORT_NUM`<br></br> -> `Since Version: 0.5.1`<br></br> +> #### hoodie.metrics.graphite.report.period.seconds +> Graphite reporting period in seconds. Default to 30.<br></br> +> **Default Value**: 30 (Optional)<br></br> +> `Config Param: GRAPHITE_REPORT_PERIOD_IN_SECONDS`<br></br> +> `Since Version: 0.10.0`<br></br> --- > #### hoodie.metrics.graphite.host -> Graphite host to connect to<br></br> +> Graphite host to connect to.<br></br> > **Default Value**: localhost (Optional)<br></br> > `Config Param: GRAPHITE_SERVER_HOST_NAME`<br></br> > `Since Version: 0.5.0`<br></br> --- -> #### hoodie.metrics.on -> Turn on/off metrics reporting. off by default.<br></br> -> **Default Value**: false (Optional)<br></br> -> `Config Param: TURN_METRICS_ON`<br></br> -> `Since Version: 0.5.0`<br></br> - ---- - > #### hoodie.metrics.graphite.metric.prefix > Standard prefix applied to all metrics. This helps to add datacenter, > environment information for e.g<br></br> > **Default Value**: N/A (Required)<br></br> @@ -3245,14 +3580,106 @@ Payload related configs, that can be leveraged to control merges based on specif --- -## Environment Config {#ENVIRONMENT_CONFIG} -Hudi supports passing configurations via a configuration file `hudi-default.conf` in which each line consists of a key and a value separated by whitespace or = sign. For example: -``` -hoodie.datasource.hive_sync.mode jdbc -hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000 -hoodie.datasource.hive_sync.support_timestamp false -``` -It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements. +## Kafka Connect Configs {#KAFKA_CONNECT} +These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables + +### Kafka Sink Connect Configurations {#Kafka-Sink-Connect-Configurations} + +Configurations for Kafka Connect Sink Connector for Hudi. + +`Config Class`: org.apache.hudi.connect.writers.KafkaConnectConfigs<br></br> +> #### hoodie.kafka.coordinator.write.timeout.secs +> The timeout after sending an END_COMMIT until when the coordinator will wait for the write statuses from all the partitionsto ignore the current commit and start a new commit.<br></br> +> **Default Value**: 300 (Optional)<br></br> +> `Config Param: COORDINATOR_WRITE_TIMEOUT_SECS`<br></br> + +--- + +> #### hoodie.meta.sync.classes +> Meta sync client tool, using comma to separate multi tools<br></br> +> **Default Value**: org.apache.hudi.hive.HiveSyncTool (Optional)<br></br> +> `Config Param: META_SYNC_CLASSES`<br></br> + +--- + +> #### hoodie.kafka.allow.commit.on.errors +> Commit even when some records failed to be written<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: ALLOW_COMMIT_ON_ERRORS`<br></br> + +--- + +> #### hoodie.meta.sync.enable +> Enable Meta Sync such as Hive<br></br> +> **Default Value**: false (Optional)<br></br> +> `Config Param: META_SYNC_ENABLE`<br></br> + +--- + +> #### hoodie.kafka.commit.interval.secs +> The interval at which Hudi will commit the records written to the files, making them consumable on the read-side.<br></br> +> **Default Value**: 60 (Optional)<br></br> +> `Config Param: COMMIT_INTERVAL_SECS`<br></br> + +--- + +> #### hoodie.kafka.control.topic +> Kafka topic name used by the Hudi Sink Connector for sending and receiving control messages. Not used for data records.<br></br> +> **Default Value**: hudi-control-topic (Optional)<br></br> +> `Config Param: CONTROL_TOPIC_NAME`<br></br> + +--- + +> #### bootstrap.servers +> The bootstrap servers for the Kafka Cluster.<br></br> +> **Default Value**: localhost:9092 (Optional)<br></br> +> `Config Param: KAFKA_BOOTSTRAP_SERVERS`<br></br> + +--- + +> #### hoodie.schemaprovider.class +> subclass of org.apache.hudi.schema.SchemaProvider to attach schemas to input & target table data, built in options: org.apache.hudi.schema.FilebasedSchemaProvider.<br></br> +> **Default Value**: org.apache.hudi.schema.FilebasedSchemaProvider (Optional)<br></br> +> `Config Param: SCHEMA_PROVIDER_CLASS`<br></br> + +--- + +> #### hoodie.kafka.compaction.async.enable +> Controls whether async compaction should be turned on for MOR table writing.<br></br> +> **Default Value**: true (Optional)<br></br> +> `Config Param: ASYNC_COMPACT_ENABLE`<br></br> -By default, Hudi would load the configuration file under `/etc/hudi/conf` directory. You can specify a different configuration directory location by setting the `HUDI_CONF_DIR` environment variable. +--- + +## Amazon Web Services Configs {#AWS} +Please fill in the description for Config Group Name: Amazon Web Services Configs + +### Amazon Web Services Configs {#Amazon-Web-Services-Configs} + +Amazon Web Services configurations to access resources like Amazon DynamoDB (for locks), Amazon CloudWatch (metrics). + +`Config Class`: org.apache.hudi.config.HoodieAWSConfig<br></br> +> #### hoodie.aws.session.token +> AWS session token<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: AWS_SESSION_TOKEN`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.aws.access.key +> AWS access key id<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: AWS_ACCESS_KEY`<br></br> +> `Since Version: 0.10.0`<br></br> + +--- + +> #### hoodie.aws.secret.key +> AWS secret key<br></br> +> **Default Value**: N/A (Required)<br></br> +> `Config Param: AWS_SECRET_KEY`<br></br> +> `Since Version: 0.10.0`<br></br> + +---