[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4927: [HUDI-3524] - Docs for basic config page

GitBox Sat, 07 May 2022 04:04:29 -0700


pratyakshsharma commented on code in PR #4927:
URL: https://github.com/apache/hudi/pull/4927#discussion_r867339082



##########
website/docs/basic_configurations.md:
##########
@@ -0,0 +1,750 @@
+---
+title: Basic Configurations
+toc: true
+---
+
+This page covers the basic configurations you may use to write/read Hudi 
tables. This page only features a subset of the
+most frequently used configurations. For a full list of all configs, please 
visit the [All Configurations](/docs/configurations) page.
+
+- [**Spark Datasource Configs**](#SPARK_DATASOURCE): These configs control the 
Hudi Spark Datasource, providing ability to define keys/partitioning, pick out 
the write operation, specify how to merge records or choosing query type to 
read.
+- [**Flink Sql Configs**](#FLINK_SQL): These configs control the Hudi Flink 
SQL source/sink connectors, providing ability to define record keys, pick out 
the write operation, specify how to merge records, enable/disable asynchronous 
compaction or choosing query type to read.
+- [**Write Client Configs**](#WRITE_CLIENT): Internally, the Hudi datasource 
uses a RDD based HoodieWriteClient API to actually perform writes to storage. 
These configs provide deep control over lower level aspects like file sizing, 
compression, parallelism, compaction, write schema, cleaning etc. Although Hudi 
provides sane defaults, from time-time these configs may need to be tweaked to 
optimize for specific workloads.
+- [**Metrics Configs**](#METRICS): These set of configs are used to enable 
monitoring and reporting of keyHudi stats and metrics.
+- [**Record Payload Config**](#RECORD_PAYLOAD): This is the lowest level of 
customization offered by Hudi. Record payloads define how to produce new values 
to upsert based on incoming new record and stored old record. Hudi provides 
default implementations such as OverwriteWithLatestAvroPayload which simply 
update table with the latest/last-written record. This can be overridden to a 
custom class extending HoodieRecordPayload class, on both datasource and 
WriteClient levels.
+
+## Spark Datasource Configs {#SPARK_DATASOURCE}
+These configs control the Hudi Spark Datasource, providing ability to define 
keys/partitioning, pick out the write operation, specify how to merge records 
or choosing query type to read.
+
+### Read Options {#Read-Options}
+
+Options useful for reading tables via `read.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+> #### hoodie.datasource.query.type
+> Whether data needs to be read, in incremental mode (new data since an 
instantTime) (or) Read Optimized mode (obtain latest view, based on base files) 
(or) Snapshot mode (obtain latest view, by merging base and (if any) log 
files)<br></br>
+> **Default Value**: snapshot (Optional)<br></br>
+> `Config Param: QUERY_TYPE`<br></br>
+
+---
+
+### Write Options {#Write-Options}
+
+You can pass down any of the WriteClient level configs directly using 
`options()` or `option(k,v)` methods.
+
+```java
+inputDF.write()
+.format("org.apache.hudi")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+```
+
+Options useful for writing tables via `write.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+
+> #### hoodie.datasource.write.operation
+> Whether to do upsert, insert or bulkinsert for the write operation. Use 
bulkinsert to load new data into a table, and there on use upsert/insert. bulk 
insert uses a disk based write path to scale to load large inputs without need 
to cache it.<br></br>
+> **Default Value**: upsert (Optional)<br></br>
+> `Config Param: OPERATION`<br></br>
+
+---
+
+> #### hoodie.datasource.write.table.type
+> The table type for the underlying data, for this write. This can’t change 
between writes.<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### hoodie.datasource.write.table.name
+> Table name for the datasource write. Also used to register the table into 
meta stores.<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.recordkey.field
+> Record key field. Value to be used as the `recordKey` component of 
`HoodieKey`.
+Actual value will be obtained by invoking .toString() on the field value. 
Nested fields can be specified using
+the dot notation eg: `a.b.c`<br></br>
+> **Default Value**: uuid (Optional)<br></br>
+> `Config Param: RECORDKEY_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.partitionpath.field
+> Partition path field. Value to be used at the partitionPath component of 
HoodieKey. Actual value ontained by invoking .toString()<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: PARTITIONPATH_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.keygenerator.class
+> Key generator class, that implements 
`org.apache.hudi.keygen.KeyGenerator`<br></br>
+> **Default Value**: org.apache.hudi.keygen.SimpleKeyGenerator 
(Optional)<br></br>
+> `Config Param: KEYGENERATOR_CLASS_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.precombine.field
+> Field used in preCombining before actual write. When two records have the 
same key value, we will pick the one with the largest value for the precombine 
field, determined by Object.compareTo(..)<br></br>
+> **Default Value**: ts (Optional)<br></br>
+> `Config Param: PRECOMBINE_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.payload.class
+> Payload class used. Override this, if you like to roll your own merge logic, 
when upserting/inserting. This will render any value set for 
PRECOMBINE_FIELD_OPT_VAL in-effective<br></br>
+> **Default Value**: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)<br></br>
+> `Config Param: PAYLOAD_CLASS_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.partitionpath.urlencode
+> Should we url encode the partition path value, before creating the folder 
structure.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: URL_ENCODE_PARTITIONING`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.enable
+> When set to true, register/sync the table to Apache Hive metastore<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_SYNC_ENABLED`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.mode
+> Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: HIVE_SYNC_MODE`<br></br>
+
+---
+
+> #### hoodie.datasource.write.hive_style_partitioning
+> Flag to indicate whether to use Hive style partitioning.
+If set true, the names of partition folders follow 
<partition_column_name>=<partition_value> format.
+By default false (the names of partition folders are only partition 
values)<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_STYLE_PARTITIONING`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_fields
+> Field in the table to use for determining hive partition columns.<br></br>
+> **Default Value**:  (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_extractor_class
+> Class which implements PartitionValueExtractor to extract the partition 
values, default 'SlashEncodedDayPartitionValueExtractor'.<br></br>
+> **Default Value**: 
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_EXTRACTOR_CLASS`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_fields
+> Field in the table to use for determining hive partition columns.<br></br>
+> **Default Value**:  (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
+
+---
+
+## Flink Sql Configs {#FLINK_SQL}
+These configs control the Hudi Flink SQL source/sink connectors, providing 
ability to define record keys, pick out the write operation, specify how to 
merge records, enable/disable asynchronous compaction or choosing query type to 
read.
+
+### Flink Options {#Flink-Options}
+
+> #### path
+> Base path for the target hoodie table.
+The path would be created if it does not exist,
+otherwise a Hoodie table expects to be initialized successfully<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: PATH`<br></br>
+
+---
+
+> #### hoodie.table.name
+> Table name to register to Hive metastore<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
+
+> #### table.type
+> Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### write.operation
+> The write operation, that this write should do<br></br>
+> **Default Value**: upsert (Optional)<br></br>
+> `Config Param: OPERATION`<br></br>
+
+---
+
+> #### write.tasks
+> Parallelism of tasks that do actual write, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: WRITE_TASKS`<br></br>
+
+---
+
+> #### write.bucket_assign.tasks
+> Parallelism of tasks that do bucket assign, default is the parallelism of 
the execution environment<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: BUCKET_ASSIGN_TASKS`<br></br>
+
+---
+
+> #### write.precombine
+> Flag to indicate whether to drop duplicates before insert/upsert.
+By default these cases will accept duplicates, to gain extra performance:
+1) insert operation;
+2) upsert for MOR table, the MOR table deduplicate on reading<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: PRE_COMBINE`<br></br>
+
+---
+
+> #### read.tasks
+> Parallelism of tasks that do actual read, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: READ_TASKS`<br></br>
+
+---
+
+> #### read.start-commit
+> Start commit instant for reading, the commit time format should be 
'yyyyMMddHHmmss', by default reading from the latest instant for streaming 
read<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: READ_START_COMMIT`<br></br>
+
+---
+
+> #### read.streaming.enabled
+> Whether to read as streaming source, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: READ_AS_STREAMING`<br></br>
+
+---
+
+> #### compaction.tasks
+> Parallelism of tasks that do actual compaction, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: COMPACTION_TASKS`<br></br>
+
+---
+
+> #### hoodie.datasource.write.hive_style_partitioning
+> Whether to use Hive style partitioning.
+If set true, the names of partition folders follow 
&lt;partition_column_name&gt;=&lt;partition_value&gt; format.
+By default false (the names of partition folders are only partition 
values)<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_STYLE_PARTITIONING`<br></br>
+
+---
+
+> #### hive_sync.enable
+> Asynchronously sync Hive meta to HMS, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_SYNC_ENABLED`<br></br>
+
+---
+
+> #### hive_sync.mode
+> Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 
'jdbc'<br></br>
+> **Default Value**: jdbc (Optional)<br></br>
+> `Config Param: HIVE_SYNC_MODE`<br></br>
+
+---
+
+>  #### hive_sync.table
+>  Table name for hive sync, default 'unknown'<br></br>

Review Comment:
   From end user perspective, I find this similar to `hoodie.table.name`. Can 
we point out the differences in a better way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4927: [HUDI-3524] - Docs for basic config page

Reply via email to