Re: [PR] [HUDI-7373] Make schema evolution doc and config correct [hudi]

via GitHub Tue, 20 Feb 2024 07:57:49 -0800


jonvex commented on code in PR #10612:
URL: https://github.com/apache/hudi/pull/10612#discussion_r1496076240



##########
website/docs/configurations.md:
##########
@@ -127,59 +127,59 @@ Options useful for writing tables via 
`write.format.option(...)`
 [**Advanced Configs**](#Write-Options-advanced-configs)
 
 
-| Config Name                                                                  
                                                                    | Default   
                                                   | Description                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
    |
-| 
------------------------------------------------------------------------------------------------------------------------------------------------
 | ------------------------------------------------------------ | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
----------------------------------------------------------------------------------
 |
-| 
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
                                                       | (N/A)                  
                                      | Serde properties to hive table.<br 
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
                                                       | (N/A)                  
                                      | Additional properties to store with 
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             
                                                                                
    |
-| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)           
                                                                    | (N/A)     
                                                   | Controls whether overwrite 
use dynamic or static mode, if not configured, respect 
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br 
/>`Since Version: 0.14.0`                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               
                                                                                
    |
-| 
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
                                                         | (N/A)                
                                        | Comma separated list of partitions to 
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                           
                                                                                
    |
-| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)        
                                                                    | (N/A)     
                                                   | Table name for the 
datasource write. Also used to register the table into meta stores.<br 
/>`Config Param: TABLE_NAME`                                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                       
                                                                                
    |
-| 
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
                                                              | true            
                                             | Controls whether async 
compaction should be turned on for MOR table writing.<br />`Config Param: 
ASYNC_COMPACT_ENABLE`                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
                                       | false                                  
                      | Assume partitioning is yyyy/MM/dd<br />`Config Param: 
HIVE_ASSUME_DATE_PARTITION`                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
           
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
                                               | true                           
                              | Auto create hive database if does not exists<br 
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                 
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
                                                       | PARQUET                
                                      | Base file format for the sync.<br 
/>`Config Param: HIVE_BASE_FILE_FORMAT`                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                               
                                                                                
    |
-| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num) 
                                                                    | 1000      
                                                   | The number of partitions 
one batch when synchronous partitions to hive.<br />`Config Param: 
HIVE_BATCH_SYNC_PARTITION_NUM`                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                     
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
                                                                 | false        
                                                | Whether sync hive metastore 
bucket specification when using bucket index.The specification is 'CLUSTERED BY 
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param: 
HIVE_SYNC_BUCKET_SYNC`                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
                                               | false                          
                              | Whether to sync the table as managed table.<br 
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                  
                                                                                
    |
-| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)   
                                                                    | default   
                                                   | The name of the 
destination database that we should sync the hudi table to.<br />`Config Param: 
HIVE_DATABASE`                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
                                                     | false                    
                                    | Ignore exceptions when syncing with 
Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS`                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
                                     | 
org.apache.hudi.hive.MultiPartKeysValueExtractor             | Class which 
implements PartitionValueExtractor to extract the partition values, default 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param: 
HIVE_PARTITION_EXTRACTOR_CLASS`                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                 
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
                                                       |                        
                                      | Field in the table to use for 
determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   
                                                                                
    |
-| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)   
                                                                    | hive      
                                                   | hive password to use<br 
/>`Config Param: HIVE_PASS`                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
                                                           | false              
                                          | Skip the _ro suffix for Read 
optimized table, when registering<br />`Config Param: 
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
                                                     | false                    
                                    | ‘INT64’ with original type 
TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for 
backward compatibility.<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              
                                                                                
            |
-| 
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
                                                   | true                       
                                  | <br />`Config Param: 
HIVE_SYNC_AS_DATA_SOURCE_TABLE`                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
                                                               | false          
                                              | Whether to sync the table 
column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT` 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       
                                                                                
    |
-| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)         
                                                                    | unknown   
                                                   | The name of the 
destination table that we should sync the hudi table to.<br />`Config Param: 
HIVE_TABLE`                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                    
                                                                                
    |
-| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)   
                                                                    | true      
                                                   | Use JDBC when hive 
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                              
                                                                                
    |
-| 
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
                                 | false                                        
                | Flag to choose InputFormat under com.uber.hoodie package 
instead of org.apache.hudi package. Use this when you are in the process of 
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you 
migrated the table definition to org.apache.hudi input format<br />`Config 
Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT`                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                    
                                                                                
    |
-| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)   
                                                                    | hive      
                                                   | hive user name to use<br 
/>`Config Param: HIVE_USER`                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
    |
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)      
                                                                    | none      
                                                   | When operation type is set 
to "insert", users can optionally enforce a dedup policy. This policy will be 
employed  when records being ingested already exists in storage. Default policy 
is none and no action will be taken. Another option is to choose  "drop", on 
which matching records from incoming will be dropped and the rest will be 
ingested. Third option is "fail" which will fail the write operation when same 
records are re-ingested. In other words, a given record as deduced by the key 
generation policy can be ingested only once to the target table of interest.<br 
/>`Config Param: INSERT_DUP_POLICY`<br />`Since Version: 0.14.0`                
                                                                                
                                                    
                                                                                
    |
-| 
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
                                                            | false             
                                           | If true, only sync on conditions 
like schema change or partition change.<br />`Config Param: 
HIVE_CONDITIONAL_SYNC`                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                    
                                                                                
    |
-| 
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
                                                       | _                      
                                      | Option keys beginning with this prefix, 
are automatically added to the commit/deltacommit metadata. This is useful to 
store checkpointing information, in a consistent way with the hudi timeline<br 
/>`Config Param: COMMIT_METADATA_KEYPREFIX`                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                            
                                                                                
    |
-| 
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
                                                     | false                    
                                    | When set to true, will not write the 
partition columns into hudi. By default, false.<br />`Config Param: 
DROP_PARTITION_COLUMNS`                                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
    |
-| 
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
                                                     | false                    
                                    | If set to true, records from the incoming 
dataframe will not overwrite existing records with the same key during the 
write operation. This config is deprecated as of 0.14.0. Please use 
hoodie.datasource.insert.dup.policy instead.<br />`Config Param: 
INSERT_DROP_DUPS`                                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                       
                                                                                
    |
-| 
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
                                                            | 
org.apache.hudi.keygen.SimpleKeyGenerator                    | Key generator 
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config 
Param: KEYGENERATOR_CLASS_NAME`                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                         
                                                                                
    |
-| 
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
 | false                                                        | When set to 
true, consistent value will be generated for a logical timestamp type column, 
like timestamp-millis and timestamp-micros, irrespective of whether row-writer 
is enabled. Disabled by default so as not to break the pipeline that deploy 
either fully row-writer path or non row-writer path. For example, if it is kept 
disabled then record key of timestamp type with value `2016-12-29 09:54:00` 
will be written as timestamp `2016-12-29 09:54:00.0` in row-writer path, while 
it will be written as long value `1483023240000000` in non row-writer path. If 
enabled, then the timestamp value will be written in both the cases.<br 
/>`Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED`             
                                                                          
                                                                                
    |
-| 
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
                                                         | false                
                                        | When a non-nullable column is added 
to datasource during a write operation, the write  operation will fail schema 
compatibility check. Set this option to true will make the newly added  column 
nullable to successfully complete the write operation.<br />`Config Param: 
MAKE_NEW_COLUMNS_NULLABLE`<br />`Since Version: 0.14.0`                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     
                                                                                
    |
-| 
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
                                                  | false                       
                                 | Should we url encode the partition path 
value, before creating the folder structure.<br />`Config Param: 
URL_ENCODE_PARTITIONING`                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
    |
-| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)  
                                                                    | 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload  | Payload class 
used. Override this, if you like to roll your own merge logic, when 
upserting/inserting. This will render any value set for 
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       
                                                                                
    |
-| 
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
                                                                | false         
                                               | This config controls how 
writer's schema will be selected based on the incoming batch's schema as well 
as existing table's one. When schema reconciliation is DISABLED, incoming 
batch's schema will be picked as a writer-schema (therefore updating table's 
schema). When schema reconciliation is ENABLED, writer-schema will be picked 
such that table's schema (after txn) is either kept the same or extended, 
meaning that we'll always prefer the schema that either adds new columns or 
stays the same. This enables us, to always extend the table's schema during 
evolution and never lose the data (when, for ex, existing column is being 
dropped in a new batch)<br />`Config Param: RECONCILE_SCHEMA`                   
                                                                          
                                                                                
    |
-| 
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
                                                           | 
org.apache.hudi.common.model.HoodieAvroRecordMerger          | List of 
HoodieMerger implementations constituting Hudi's merging strategy -- based on 
the engine used. These merger impls will filter by 
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient 
implementation to perform merging/combining of the records (during update, 
reading MOR table, etc)<br />`Config Param: RECORD_MERGER_IMPLS`<br />`Since 
Version: 0.13.0`                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                   
                                                                                
    |
-| 
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
                                                     | 
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5                         | Id of merger 
strategy. Hudi will pick HoodieRecordMerger implementations in 
hoodie.datasource.write.record.merger.impls which has the same merger strategy 
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                      
                                                                                
    |
-| 
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
                                                               | true           
                                              | When set to true, will perform 
write operations directly using the spark native `Row` representation, avoiding 
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                  
                                                                                
    |
-| 
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
                                   | default_single_writer                      
                  | A stream identifier used for HUDI to fetch the right 
checkpoint(`batch id` to be more specific) corresponding this writer. Please 
note that keep the identifier an unique value for different writer if under 
multi-writer scenario. If the value is not set, will only keep the checkpoint 
info in the memory. This could introduce the potential issue that the job is 
restart(`batch id` is lost) while spark checkpoint write fails, causing spark 
will retry and rewrite the data.<br />`Config Param: 
STREAMING_CHECKPOINT_IDENTIFIER`<br />`Since Version: 0.13.0`                   
                                                                                
                                                                                
                                                     
                                                                                
    |
-| 
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
                                         | false                                
                        | By default for MOR table, async compaction is enabled 
with spark streaming sink. By setting this config to true, we can disable it 
and the expectation is that, users will schedule and execute compaction in a 
different process/job altogether. Some users may wish to run it separately to 
manage resources across table services and regular ingestion pipeline and so 
this could be preferred on such cases.<br />`Config Param: 
STREAMING_DISABLE_COMPACTION`<br />`Since Version: 0.14.0`                      
                                                                                
                                                                                
                                                                                
                                           
                                                                                
    |
-| 
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
                                        | false                                 
                       | Config to indicate whether to ignore any non exception 
error (e.g. writestatus error) within a streaming microbatch. Turning this on, 
could hide the write status errors while the spark checkpoint moves ahead.So, 
would recommend users to use this with caution.<br />`Config Param: 
STREAMING_IGNORE_FAILED_BATCH`                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                         
                                                                                
    |
-| 
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
                                                       | 3                      
                                      | Config to indicate how many times 
streaming job should retry for a failed micro batch.<br />`Config Param: 
STREAMING_RETRY_CNT`                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
    |
-| 
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
                                            | 2000                              
                           |  Config to indicate how long (by millisecond) 
before a retry should issued for failed microbatch<br />`Config Param: 
STREAMING_RETRY_INTERVAL_MS`                                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                            
                                                                                
    |
-| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)         
                                                                    | 
org.apache.hudi.hive.HiveSyncTool                            | Sync tool class 
name used to sync to metastore. Defaults to Hive.<br />`Config Param: 
META_SYNC_CLIENT_TOOL_CLASS_NAME`                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                           
                                                                                
    |
+| Config Name                                                                  
                                                                    | Default   
                                                   | Description                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                              |

Review Comment:
   configs have been updated so I reverted my changes from the configs pages



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7373] Make schema evolution doc and config correct [hudi]

Reply via email to