[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4927: [HUDI-3524] - Docs for basic config page

GitBox Sat, 07 May 2022 03:53:30 -0700


pratyakshsharma commented on code in PR #4927:
URL: https://github.com/apache/hudi/pull/4927#discussion_r867338139



##########
website/docs/basic_configurations.md:
##########
@@ -0,0 +1,750 @@
+---
+title: Basic Configurations
+toc: true
+---
+
+This page covers the basic configurations you may use to write/read Hudi 
tables. This page only features a subset of the
+most frequently used configurations. For a full list of all configs, please 
visit the [All Configurations](/docs/configurations) page.
+
+- [**Spark Datasource Configs**](#SPARK_DATASOURCE): These configs control the 
Hudi Spark Datasource, providing ability to define keys/partitioning, pick out 
the write operation, specify how to merge records or choosing query type to 
read.
+- [**Flink Sql Configs**](#FLINK_SQL): These configs control the Hudi Flink 
SQL source/sink connectors, providing ability to define record keys, pick out 
the write operation, specify how to merge records, enable/disable asynchronous 
compaction or choosing query type to read.
+- [**Write Client Configs**](#WRITE_CLIENT): Internally, the Hudi datasource 
uses a RDD based HoodieWriteClient API to actually perform writes to storage. 
These configs provide deep control over lower level aspects like file sizing, 
compression, parallelism, compaction, write schema, cleaning etc. Although Hudi 
provides sane defaults, from time-time these configs may need to be tweaked to 
optimize for specific workloads.
+- [**Metrics Configs**](#METRICS): These set of configs are used to enable 
monitoring and reporting of keyHudi stats and metrics.
+- [**Record Payload Config**](#RECORD_PAYLOAD): This is the lowest level of 
customization offered by Hudi. Record payloads define how to produce new values 
to upsert based on incoming new record and stored old record. Hudi provides 
default implementations such as OverwriteWithLatestAvroPayload which simply 
update table with the latest/last-written record. This can be overridden to a 
custom class extending HoodieRecordPayload class, on both datasource and 
WriteClient levels.
+
+## Spark Datasource Configs {#SPARK_DATASOURCE}
+These configs control the Hudi Spark Datasource, providing ability to define 
keys/partitioning, pick out the write operation, specify how to merge records 
or choosing query type to read.
+
+### Read Options {#Read-Options}
+
+Options useful for reading tables via `read.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+> #### hoodie.datasource.query.type
+> Whether data needs to be read, in incremental mode (new data since an 
instantTime) (or) Read Optimized mode (obtain latest view, based on base files) 
(or) Snapshot mode (obtain latest view, by merging base and (if any) log 
files)<br></br>
+> **Default Value**: snapshot (Optional)<br></br>
+> `Config Param: QUERY_TYPE`<br></br>
+
+---
+
+### Write Options {#Write-Options}
+
+You can pass down any of the WriteClient level configs directly using 
`options()` or `option(k,v)` methods.
+
+```java
+inputDF.write()
+.format("org.apache.hudi")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+```
+
+Options useful for writing tables via `write.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+
+> #### hoodie.datasource.write.operation
+> Whether to do upsert, insert or bulkinsert for the write operation. Use 
bulkinsert to load new data into a table, and there on use upsert/insert. bulk 
insert uses a disk based write path to scale to load large inputs without need 
to cache it.<br></br>

Review Comment:
   there on -> thereafter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4927: [HUDI-3524] - Docs for basic config page

Reply via email to