jonvex commented on code in PR #10612: URL: https://github.com/apache/hudi/pull/10612#discussion_r1496076240
##########
website/docs/configurations.md:
##########
@@ -127,59 +127,59 @@ Options useful for writing tables via
`write.format.option(...)`
[**Advanced Configs**](#Write-Options-advanced-configs)
-| Config Name
| Default
| Description
|
-|
------------------------------------------------------------------------------------------------------------------------------------------------
| ------------------------------------------------------------ |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------
|
-|
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
| (N/A)
| Serde properties to hive table.<br
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`
|
-|
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
| (N/A)
| Additional properties to store with
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`
|
-| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)
| (N/A)
| Controls whether overwrite
use dynamic or static mode, if not configured, respect
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br
/>`Since Version: 0.14.0`
|
-|
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
| (N/A)
| Comma separated list of partitions to
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`
|
-| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A)
| Table name for the
datasource write. Also used to register the table into meta stores.<br
/>`Config Param: TABLE_NAME`
|
-|
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
| true
| Controls whether async
compaction should be turned on for MOR table writing.<br />`Config Param:
ASYNC_COMPACT_ENABLE`
|
-|
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
| false
| Assume partitioning is yyyy/MM/dd<br />`Config Param:
HIVE_ASSUME_DATE_PARTITION`
|
-|
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
| true
| Auto create hive database if does not exists<br
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`
|
-|
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
| PARQUET
| Base file format for the sync.<br
/>`Config Param: HIVE_BASE_FILE_FORMAT`
|
-| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num)
| 1000
| The number of partitions
one batch when synchronous partitions to hive.<br />`Config Param:
HIVE_BATCH_SYNC_PARTITION_NUM`
|
-|
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
| false
| Whether sync hive metastore
bucket specification when using bucket index.The specification is 'CLUSTERED BY
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param:
HIVE_SYNC_BUCKET_SYNC`
|
-|
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
| false
| Whether to sync the table as managed table.<br
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`
|
-| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)
| default
| The name of the
destination database that we should sync the hudi table to.<br />`Config Param:
HIVE_DATABASE`
|
-|
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
| false
| Ignore exceptions when syncing with
Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS`
|
-|
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
|
org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which
implements PartitionValueExtractor to extract the partition values, default
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param:
HIVE_PARTITION_EXTRACTOR_CLASS`
|
-|
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
|
| Field in the table to use for
determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`
|
-| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)
| hive
| hive password to use<br
/>`Config Param: HIVE_PASS`
|
-|
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
| false
| Skip the _ro suffix for Read
optimized table, when registering<br />`Config Param:
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
|
-|
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
| false
| ‘INT64’ with original type
TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for
backward compatibility.<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
|
-|
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
| true
| <br />`Config Param:
HIVE_SYNC_AS_DATA_SOURCE_TABLE`
|
-|
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
| false
| Whether to sync the table
column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT`
|
-| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)
| unknown
| The name of the
destination table that we should sync the hudi table to.<br />`Config Param:
HIVE_TABLE`
|
-| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
|
-|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Config
Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT`
|
-| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
|
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| When operation type is set
to "insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be taken. Another option is to choose "drop", on
which matching records from incoming will be dropped and the rest will be
ingested. Third option is "fail" which will fail the write operation when same
records are re-ingested. In other words, a given record as deduced by the key
generation policy can be ingested only once to the target table of interest.<br
/>`Config Param: INSERT_DUP_POLICY`<br />`Since Version: 0.14.0`
|
-|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
|
-|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
|
-|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
|
-|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. This config is deprecated as of 0.14.0. Please use
hoodie.datasource.insert.dup.policy instead.<br />`Config Param:
INSERT_DROP_DUPS`
|
-|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
|
-|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non row-writer path. For example, if it is kept
disabled then record key of timestamp type with value `2016-12-29 09:54:00`
will be written as timestamp `2016-12-29 09:54:00.0` in row-writer path, while
it will be written as long value `1483023240000000` in non row-writer path. If
enabled, then the timestamp value will be written in both the cases.<br
/>`Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED`
|
-|
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
| false
| When a non-nullable column is added
to datasource during a write operation, the write operation will fail schema
compatibility check. Set this option to true will make the newly added column
nullable to successfully complete the write operation.<br />`Config Param:
MAKE_NEW_COLUMNS_NULLABLE`<br />`Since Version: 0.14.0`
|
-|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false
| Should we url encode the partition path
value, before creating the folder structure.<br />`Config Param:
URL_ENCODE_PARTITIONING`
|
-| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`
|
-|
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
| false
| This config controls how
writer's schema will be selected based on the incoming batch's schema as well
as existing table's one. When schema reconciliation is DISABLED, incoming
batch's schema will be picked as a writer-schema (therefore updating table's
schema). When schema reconciliation is ENABLED, writer-schema will be picked
such that table's schema (after txn) is either kept the same or extended,
meaning that we'll always prefer the schema that either adds new columns or
stays the same. This enables us, to always extend the table's schema during
evolution and never lose the data (when, for ex, existing column is being
dropped in a new batch)<br />`Config Param: RECONCILE_SCHEMA`
|
-|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of
HoodieMerger implementations constituting Hudi's merging strategy -- based on
the engine used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during update,
reading MOR table, etc)<br />`Config Param: RECORD_MERGER_IMPLS`<br />`Since
Version: 0.13.0`
|
-|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
|
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger
strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
|
-|
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
| true
| When set to true, will perform
write operations directly using the spark native `Row` representation, avoiding
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`
|
-|
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
| default_single_writer
| A stream identifier used for HUDI to fetch the right
checkpoint(`batch id` to be more specific) corresponding this writer. Please
note that keep the identifier an unique value for different writer if under
multi-writer scenario. If the value is not set, will only keep the checkpoint
info in the memory. This could introduce the potential issue that the job is
restart(`batch id` is lost) while spark checkpoint write fails, causing spark
will retry and rewrite the data.<br />`Config Param:
STREAMING_CHECKPOINT_IDENTIFIER`<br />`Since Version: 0.13.0`
|
-|
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
| false
| By default for MOR table, async compaction is enabled
with spark streaming sink. By setting this config to true, we can disable it
and the expectation is that, users will schedule and execute compaction in a
different process/job altogether. Some users may wish to run it separately to
manage resources across table services and regular ingestion pipeline and so
this could be preferred on such cases.<br />`Config Param:
STREAMING_DISABLE_COMPACTION`<br />`Since Version: 0.14.0`
|
-|
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
| false
| Config to indicate whether to ignore any non exception
error (e.g. writestatus error) within a streaming microbatch. Turning this on,
could hide the write status errors while the spark checkpoint moves ahead.So,
would recommend users to use this with caution.<br />`Config Param:
STREAMING_IGNORE_FAILED_BATCH`
|
-|
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
| 3
| Config to indicate how many times
streaming job should retry for a failed micro batch.<br />`Config Param:
STREAMING_RETRY_CNT`
|
-|
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
| 2000
| Config to indicate how long (by millisecond)
before a retry should issued for failed microbatch<br />`Config Param:
STREAMING_RETRY_INTERVAL_MS`
|
-| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)
|
org.apache.hudi.hive.HiveSyncTool | Sync tool class
name used to sync to metastore. Defaults to Hive.<br />`Config Param:
META_SYNC_CLIENT_TOOL_CLASS_NAME`
|
+| Config Name
| Default
| Description
|
Review Comment:
configs have been updated so I reverted my changes from the configs pages
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
