This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 4ce0db3b93 [DOCS] update broken links (#5333) 4ce0db3b93 is described below commit 4ce0db3b93967158b5e854d8230d71a38e221c77 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Mon Apr 18 16:22:51 2022 -0700 [DOCS] update broken links (#5333) Co-authored-by: Bhavani Sudha Saktheeswaran <sudha@vmacs.local> --- website/docs/clustering.md | 20 ++++++++++---------- website/docs/concurrency_control.md | 10 +++++----- website/docs/deployment.md | 8 ++++---- website/docs/faq.md | 8 ++++---- website/docs/flink-quick-start-guide.md | 2 +- website/docs/flink_configuration.md | 2 +- website/docs/hoodie_cleaner.md | 2 +- website/docs/hoodie_deltastreamer.md | 8 ++++---- website/docs/key_generation.md | 2 +- website/docs/metrics.md | 10 +++++----- website/docs/performance.md | 8 ++++---- website/docs/query_engine_setup.md | 2 +- website/docs/querying_data.md | 8 ++++---- website/docs/quick-start-guide.md | 12 ++++++------ website/docs/use_cases.md | 4 ++-- website/docs/write_operations.md | 2 +- website/docs/writing_data.md | 22 +++++++++++----------- 17 files changed, 65 insertions(+), 65 deletions(-) diff --git a/website/docs/clustering.md b/website/docs/clustering.md index f210a15b1b..9e157de785 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -12,7 +12,7 @@ Apache Hudi brings stream processing to big data, providing fresh data while bei ## Clustering Architecture -At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data [...] +At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit) to `0` to force new [...] @@ -95,12 +95,12 @@ broadly classified into three types: clustering plan strategy, execution strateg This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered. Let's look at different plan strategies that are available with Hudi. Note that these strategies are easily pluggable -using this [config](/docs/next/configurations#hoodieclusteringplanstrategyclass). +using this [config](/docs/configurations#hoodieclusteringplanstrategyclass). 1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on - the [small file limit](/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit) + the [small file limit](/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit) of base files and creates clustering groups upto max file size allowed per group. The max size can be specified using - this [config](/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup). This + this [config](/docs/configurations/#hoodieclusteringplanstrategymaxbytespergroup). This strategy is useful for stitching together medium-sized files into larger ones to reduce lot of files spread across cold partitions. 2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days partitions and creates a plan that will @@ -122,12 +122,12 @@ All the strategies are partition-aware and the latter two are still bound by the ### Execution Strategy After building the clustering groups in the planning phase, Hudi applies execution strategy, for each group, primarily -based on sort columns and size. The strategy can be specified using this [config](/docs/next/configurations/#hoodieclusteringexecutionstrategyclass). +based on sort columns and size. The strategy can be specified using this [config](/docs/configurations/#hoodieclusteringexecutionstrategyclass). `SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify the columns to sort the data by, when clustering using -this [config](/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns). Apart from -that, we can also set [max file size](/docs/next/configurations/#hoodieparquetmaxfilesize) +this [config](/docs/configurations/#hoodieclusteringplanstrategysortcolumns). Apart from +that, we can also set [max file size](/docs/configurations/#hoodieparquetmaxfilesize) for the parquet files produced due to clustering. The strategy uses bulk insert to write data into new files, in which case, Hudi implicitly uses a partitioner that does sorting based on specified columns. In this way, the strategy changes the data layout in a way that not only improves query performance but also balance rewrite overhead automatically. @@ -135,19 +135,19 @@ the data layout in a way that not only improves query performance but also balan Now this strategy can be executed either as a single spark job or multiple jobs depending on number of clustering groups created in the planning phase. By default, Hudi will submit multiple spark jobs and union the results. In case you want to force Hudi to use single spark job, set the execution strategy -class [config](/docs/next/configurations/#hoodieclusteringexecutionstrategyclass) +class [config](/docs/configurations/#hoodieclusteringexecutionstrategyclass) to `SingleSparkJobExecutionStrategy`. ### Update Strategy Currently, clustering can only be scheduled for tables/partitions not receiving any concurrent updates. By default, -the [config for update strategy](/docs/next/configurations/#hoodieclusteringupdatesstrategy) is +the [config for update strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to ***SparkRejectUpdateStrategy***. If some file group has updates during clustering then it will reject updates and throw an exception. However, in some use-cases updates are very sparse and do not touch most file groups. The default strategy to simply reject updates does not seem fair. In such use-cases, users can set the config to ***SparkAllowUpdateStrategy***. We discussed the critical strategy configurations. All other configurations related to clustering are -listed [here](/docs/next/configurations/#Clustering-Configs). Out of this list, a few +listed [here](/docs/configurations/#Clustering-Configs). Out of this list, a few configurations that will be very useful are: | Config key | Remarks | Default | diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md index a9a0d5860c..e71cb4a8f2 100644 --- a/website/docs/concurrency_control.md +++ b/website/docs/concurrency_control.md @@ -19,13 +19,13 @@ between multiple table service writers and readers. Additionally, using MVCC, Hu the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. -It may be helpful to understand the different guarantees provided by [write operations](/docs/writing_data#write-operations) via Hudi datasource or the delta streamer. +It may be helpful to understand the different guarantees provided by [write operations](/docs/write_operations/) via Hudi datasource or the delta streamer. ## Single Writer Guarantees - *UPSERT Guarantee*: The target table will NEVER show duplicates. - - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. - - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. + - *INSERT Guarantee*: The target table wilL NEVER have duplicates if [dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. + - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if [dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out of order. ## Multi Writer Guarantees @@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees provided by [write oper With multiple writers using OCC, some of the above guarantees change as follows - *UPSERT Guarantee*: The target table will NEVER show duplicates. -- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. -- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled. +- *INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. +- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if [dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is enabled. - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out of order due to multiple writer jobs finishing at different times. ## Enabling Multi Writing diff --git a/website/docs/deployment.md b/website/docs/deployment.md index a33c30a951..739480205d 100644 --- a/website/docs/deployment.md +++ b/website/docs/deployment.md @@ -25,9 +25,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting d ### DeltaStreamer -[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. +[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes. - - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for eve [...] + - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for eve [...] Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster. @@ -126,7 +126,7 @@ Here is an example invocation for reading from kafka topic in a continuous mode ### Spark Datasource Writer Jobs -As described in [Writing Data](/docs/writing_data#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inline.ma [...] +As described in [Writing Data](/docs/writing_data#spark-datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compact.inl [...] Here is an example invocation using spark datasource @@ -144,7 +144,7 @@ inputDF.write() ## Upgrading -New Hudi releases are listed on the [releases page](/releases), with detailed notes which list all the changes, with highlights in each release. +New Hudi releases are listed on the [releases page](/releases/download), with detailed notes which list all the changes, with highlights in each release. At the end of the day, Hudi is a storage system and with that comes a lot of responsibilities, which we take seriously. As general guidelines, diff --git a/website/docs/faq.md b/website/docs/faq.md index c675788561..cee9e583e5 100644 --- a/website/docs/faq.md +++ b/website/docs/faq.md @@ -83,7 +83,7 @@ At a high level, Hudi is based on MVCC design that writes data to versioned parq ### What are some ways to write a Hudi dataset? -Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/writing_data/) against a Hudi dataset. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/writing_data/#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data from a custom [...] +Typically, you obtain a set of partial updates/inserts from your source and issue [write operations](https://hudi.apache.org/docs/write_operations/) against a Hudi dataset. If you ingesting data from any of the standard sources like Kafka, or tailing DFS, the [delta streamer](https://hudi.apache.org/docs/hoodie_deltastreamer#deltastreamer) tool is invaluable and provides an easy, self-managed solution to getting data written into Hudi. You can also write your own code to capture data fr [...] ### How is a Hudi job deployed? @@ -225,7 +225,7 @@ set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat ### Can I register my Hudi dataset with Apache Hive metastore? -Yes. This can be performed either via the standalone [Hive Sync tool](https://hudi.apache.org/docs/writing_data/#syncing-to-hive) or using options in [deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50) tool or [datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable). +Yes. This can be performed either via the standalone [Hive Sync tool](https://hudi.apache.org/docs/syncing_metastore#hive-sync-tool) or using options in [deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50) tool or [datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable). ### How does the Hudi indexing work & what are its benefits? @@ -255,7 +255,7 @@ That said, for obvious reasons of not blocking ingesting for compaction, you may ### What performance/ingest latency can I expect for Hudi writing? -The speed at which you can write into Hudi depends on the [write operation](https://hudi.apache.org/docs/writing_data/) and some trade-offs you make along the way like file sizing. Just like how databases incur overhead over direct/raw file I/O on disks, Hudi operations may have overhead from supporting database like features compared to reading/writing raw DFS files. That said, Hudi implements advanced techniques from database literature to keep these minimal. User is encouraged to ha [...] +The speed at which you can write into Hudi depends on the [write operation](https://hudi.apache.org/docs/write_operations) and some trade-offs you make along the way like file sizing. Just like how databases incur overhead over direct/raw file I/O on disks, Hudi operations may have overhead from supporting database like features compared to reading/writing raw DFS files. That said, Hudi implements advanced techniques from database literature to keep these minimal. User is encouraged to [...] | Storage Type | Type of workload | Performance | Tips | |-------|--------|--------|--------| @@ -364,7 +364,7 @@ spark.read.parquet("your_data_set/path/to/month").limit(n) // Limit n records .save(basePath); ``` -For merge on read table, you may want to also try scheduling and running compaction jobs. You can run compaction directly using spark submit on org.apache.hudi.utilities.HoodieCompactor or by using [HUDI CLI](https://hudi.apache.org/docs/deployment/#cli). +For merge on read table, you may want to also try scheduling and running compaction jobs. You can run compaction directly using spark submit on org.apache.hudi.utilities.HoodieCompactor or by using [HUDI CLI](https://hudi.apache.org/docs/cli). ### If I keep my file versions at 1, with this configuration will i be able to do a roll back (to the last commit) when write fail? diff --git a/website/docs/flink-quick-start-guide.md b/website/docs/flink-quick-start-guide.md index a723b8ed7b..daec4ba0b5 100644 --- a/website/docs/flink-quick-start-guide.md +++ b/website/docs/flink-quick-start-guide.md @@ -31,7 +31,7 @@ Start a standalone Flink cluster within hadoop environment. Before you start up the cluster, we suggest to config the cluster as follows: - in `$FLINK_HOME/conf/flink-conf.yaml`, add config option `taskmanager.numberOfTaskSlots: 4` -- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations according to the characteristics of your task](#flink-configuration) +- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations according to the characteristics of your task](flink_configuration#global-configurations) - in `$FLINK_HOME/conf/workers`, add item `localhost` as 4 lines so that there are 4 workers on the local cluster Now starts the cluster: diff --git a/website/docs/flink_configuration.md b/website/docs/flink_configuration.md index ba7853d7cd..d615281a6b 100644 --- a/website/docs/flink_configuration.md +++ b/website/docs/flink_configuration.md @@ -60,7 +60,7 @@ allocated with enough memory, we can try to set these memory options. | `write.bucket_assign.tasks` | The parallelism of bucket assigner operators. No default value, using Flink `parallelism.default` | [`parallelism.default`](#parallelism) | Increases the parallelism also increases the number of buckets, thus the number of small files (small buckets) | | `write.index_boostrap.tasks` | The parallelism of index bootstrap. Increasing parallelism can speed up the efficiency of the bootstrap stage. The bootstrap stage will block checkpointing. Therefore, it is necessary to set more checkpoint failure tolerance times. Default using Flink `parallelism.default` | [`parallelism.default`](#parallelism) | It only take effect when `index.bootsrap.enabled` is `true` | | `read.tasks` | The parallelism of read operators (batch and stream). Default `4` | `4` | | -| `compaction.tasks` | The parallelism of online compaction. Default `4` | `4` | `Online compaction` will occupy the resources of the write task. It is recommended to use [`offline compaction`](#offline-compaction) | +| `compaction.tasks` | The parallelism of online compaction. Default `4` | `4` | `Online compaction` will occupy the resources of the write task. It is recommended to use [`offline compaction`](/docs/compaction/#flink-offline-compaction) | ### Compaction diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md index 41956f566c..10f1aa2450 100644 --- a/website/docs/hoodie_cleaner.md +++ b/website/docs/hoodie_cleaner.md @@ -47,7 +47,7 @@ hoodie.clean.async=true ``` ### CLI -You can also use [Hudi CLI](https://hudi.apache.org/docs/deployment#cli) to run Hoodie Cleaner. +You can also use [Hudi CLI](/docs/cli) to run Hoodie Cleaner. CLI provides the below commands for cleaner service: - `cleans show` diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md index f212f57859..3c49bd2bbf 100644 --- a/website/docs/hoodie_deltastreamer.md +++ b/website/docs/hoodie_deltastreamer.md @@ -374,7 +374,7 @@ frequent `file handle` switching. :::note The parallelism of `bulk_insert` is specified by `write.tasks`. The parallelism will affect the number of small files. In theory, the parallelism of `bulk_insert` is the number of `bucket`s (In particular, when each bucket writes to maximum file size, it -will rollover to the new file handle. Finally, `the number of files` >= [`write.bucket_assign.tasks`](#parallelism)). +will rollover to the new file handle. Finally, `the number of files` >= [`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks). ::: #### Options @@ -382,9 +382,9 @@ will rollover to the new file handle. Finally, `the number of files` >= [`write. | Option Name | Required | Default | Remarks | | ----------- | ------- | ------- | ------- | | `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open this function | -| `write.tasks` | `false` | `4` | The parallelism of `bulk_insert`, `the number of files` >= [`write.bucket_assign.tasks`](#parallelism) | -| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to shuffle data according to the partition field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew | -| `write.bulk_insert.sort_by_partition` | `false` | `true` | Whether to sort data according to the partition field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions | +| `write.tasks` | `false` | `4` | The parallelism of `bulk_insert`, `the number of files` >= [`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) | +| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to shuffle data according to the partition field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew | +| `write.bulk_insert.sort_by_partition` | `false` | `true` | Whether to sort data according to the partition field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions | | `write.sort.memory` | `false` | `128` | Available managed memory of sort operator. default `128` MB | ### Index Bootstrap diff --git a/website/docs/key_generation.md b/website/docs/key_generation.md index f20e4d77a1..1dcb020645 100644 --- a/website/docs/key_generation.md +++ b/website/docs/key_generation.md @@ -17,7 +17,7 @@ Hudi provides several key generators out of the box that users can use based on implementation for users to implement and use their own KeyGenerator. This page goes over all different types of key generators that are readily available to use. -[Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java) +[Here](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java) is the interface for KeyGenerator in Hudi for your reference. Before diving into different types of key generators, let’s go over some of the common configs required to be set for diff --git a/website/docs/metrics.md b/website/docs/metrics.md index 17441447fa..4a831d7981 100644 --- a/website/docs/metrics.md +++ b/website/docs/metrics.md @@ -6,7 +6,7 @@ toc: true last_modified_at: 2020-06-20T15:59:57-04:00 --- -In this section, we will introduce the `MetricsReporter` and `HoodieMetrics` in Hudi. You can view the metrics-related configurations [here](configurations#metrics-configs). +In this section, we will introduce the `MetricsReporter` and `HoodieMetrics` in Hudi. You can view the metrics-related configurations [here](configurations#METRICS). ## MetricsReporter @@ -17,7 +17,7 @@ MetricsReporter provides APIs for reporting `HoodieMetrics` to user-specified ba JmxMetricsReporter is an implementation of JMX reporter, which used to report JMX metrics. #### Configurations -The following is an example of `JmxMetricsReporter`. More detaile configurations can be referenced [here](configurations#jmx). +The following is an example of `JmxMetricsReporter`. More detailed configurations can be referenced [here](configurations#Metrics-Configurations-for-Jmx). ```properties hoodie.metrics.on=true @@ -37,7 +37,7 @@ As configured above, JmxMetricsReporter will started JMX server on port 4001. We MetricsGraphiteReporter is an implementation of Graphite reporter, which connects to a Graphite server, and send `HoodieMetrics` to it. #### Configurations -The following is an example of `MetricsGraphiteReporter`. More detaile configurations can be referenced [here](configurations#graphite). +The following is an example of `MetricsGraphiteReporter`. More detaile configurations can be referenced [here](configurations#Metrics-Configurations-for-Graphite). ```properties hoodie.metrics.on=true @@ -58,7 +58,7 @@ DatadogMetricsReporter is an implementation of Datadog reporter. A reporter which publishes metric values to Datadog monitoring service via Datadog HTTP API. #### Configurations -The following is an example of `DatadogMetricsReporter`. More detailed configurations can be referenced [here](configurations#datadog). +The following is an example of `DatadogMetricsReporter`. More detailed configurations can be referenced [here](configurations#Metrics-Configurations-for-Datadog-reporter). ```properties hoodie.metrics.on=true @@ -138,7 +138,7 @@ tuned are in the `HoodieMetricsCloudWatchConfig` class. Allows users to define a custom metrics reporter. #### Configurations -The following is an example of `UserDefinedMetricsReporter`. More detailed configurations can be referenced [here](configurations#user-defined-reporter). +The following is an example of `UserDefinedMetricsReporter`. More detailed configurations can be referenced [here](configurations#Metrics-Configurations). ```properties hoodie.metrics.on=true diff --git a/website/docs/performance.md b/website/docs/performance.md index 53152730bd..db78a7f25b 100644 --- a/website/docs/performance.md +++ b/website/docs/performance.md @@ -14,12 +14,12 @@ column statistics etc. Even on some cloud data stores, there is often cost to li Here are some ways to efficiently manage the storage of your Hudi tables. -- The [small file handling feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles incoming workload +- The [small file handling feature](/docs/configurations/#hoodieparquetsmallfilelimit) in Hudi, profiles incoming workload and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. -- Cleaner can be [configured](/docs/configurations#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull -- User can also tune the size of the [base/parquet file](/docs/configurations#limitFileSize), [log files](/docs/configurations#logFileMaxSize) & expected [compression ratio](/docs/configurations#parquetCompressionRatio), +- Cleaner can be [configured](/docs/configurations#hoodiecleanercommitsretained) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull +- User can also tune the size of the [base/parquet file](/docs/configurations#hoodieparquetmaxfilesize), [log files](/docs/configurations#hoodielogfilemaxsize) & expected [compression ratio](/docs/configurations#hoodieparquetcompressionratio), such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately. -- Intelligently tuning the [bulk insert parallelism](/docs/configurations#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups +- Intelligently tuning the [bulk insert parallelism](/docs/configurations#hoodiebulkinsertshuffleparallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. diff --git a/website/docs/query_engine_setup.md b/website/docs/query_engine_setup.md index d89a96d042..8d555dae3e 100644 --- a/website/docs/query_engine_setup.md +++ b/website/docs/query_engine_setup.md @@ -64,7 +64,7 @@ To query Hudi tables on Trino, please place the `hudi-presto-bundle` jar into th ## Hive In order for Hive to recognize Hudi tables and query correctly, -- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf#concept_nc3_mms_lr). This will ensure the input format +- the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format classes with its dependencies are available for query planning & execution. - For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster, so that queries can pick up the custom RecordReader as well. diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md index c516708e7d..1b5cee0d5b 100644 --- a/website/docs/querying_data.md +++ b/website/docs/querying_data.md @@ -49,7 +49,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hu ``` For examples, refer to [Incremental Queries](/docs/quick-start-guide#incremental-query) in the Spark quickstart. -Please refer to [configurations](/docs/configurations#spark-datasource) section, to view all datasource options. +Please refer to [configurations](/docs/configurations#SPARK_DATASOURCE) section, to view all datasource options. Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing. @@ -170,16 +170,16 @@ would ensure Map Reduce execution is chosen for a Hive query, which combines par separated) and calls InputFormat.listStatus() only once with all those partitions. ## PrestoDB -To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#PrestoDB) page. +To setup PrestoDB for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#prestodb) page. ## Trino -To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#Trino) page. +To setup Trino for querying Hudi, see the [Query Engine Setup](/docs/query_engine_setup#trino) page. ## Impala (3.4 or later) ### Snapshot Query -Impala is able to query Hudi Copy-on-write table as an [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables#external_tables) on HDFS. +Impala is able to query Hudi Copy-on-write table as an [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables) on HDFS. To create a Hudi read optimized table on Impala: ``` diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 6446016254..51d0b838f7 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -412,12 +412,12 @@ df.write.format("hudi"). :::info `mode(Overwrite)` overwrites and recreates the table if it already exists. You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key -(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in -[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) +(`uuid` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in +[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to +[Modeling data stored in Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: </TabItem> @@ -453,7 +453,7 @@ You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/< [Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: </TabItem> @@ -1117,7 +1117,7 @@ more details please refer to [procedures](/docs/next/procedures). You can also do the quickstart by [building hudi yourself](https://github.com/apache/hudi#building-apache-hudi-from-source), and using `--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.1?-*.*.*-SNAPSHOT.jar` in the spark-shell command above -instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`. Hudi also supports scala 2.12. Refer [build with scala 2.12](https://github.com/apache/hudi#build-with-scala-212) +instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`. Hudi also supports scala 2.12. Refer [build with scala 2.12](https://github.com/apache/hudi#build-with-different-spark-versions) for more info. Also, we used Spark here to show case the capabilities of Hudi. However, Hudi can support multiple table types/query types and diff --git a/website/docs/use_cases.md b/website/docs/use_cases.md index 3758d7208e..f3fabdf04d 100644 --- a/website/docs/use_cases.md +++ b/website/docs/use_cases.md @@ -15,7 +15,7 @@ This blog post outlines this use case in more depth - https://hudi.apache.org/bl ### Near Real-Time Ingestion -Ingesting data from OLTP sources like (event logs, databases, external sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake) is a common problem, +Ingesting data from OLTP sources like (event logs, databases, external sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake.html) is a common problem, that is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools. This "raw data" layer of the data lake often forms the bedrock on which more value is created. @@ -27,7 +27,7 @@ even moderately big installations store billions of rows. It goes without saying are needed if ingestion is to keep up with the typically high update volumes. Even for immutable data sources like [Kafka](https://kafka.apache.org), there is often a need to de-duplicate the incoming events against what's stored on DFS. -Hudi achieves this by [employing indexes](http://hudi.apache.org/blog/hudi-indexing-mechanisms/) of different kinds, quickly and efficiently. +Hudi achieves this by [employing indexes](http://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) of different kinds, quickly and efficiently. All of this is seamlessly achieved by the Hudi DeltaStreamer tool, which is maintained in tight integration with rest of the code and we are always trying to add more capture sources, to make this easier for the users. The tool also has a continuous mode, where it diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md index ccdac23350..746a93d057 100644 --- a/website/docs/write_operations.md +++ b/website/docs/write_operations.md @@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b ## Writing path The following is an inside look on the Hudi write path and the sequence of events that occur during a write. -1. [Deduping](/docs/configurations/#writeinsertdeduplicate) +1. [Deduping](/docs/configurations#hoodiecombinebeforeinsert) 1. First your input records may have duplicate keys within the same batch and duplicates need to be combined or reduced by key. 2. [Index Lookup](/docs/next/indexing) 1. Next, an index lookup is performed to try and match the input records to identify which file groups they belong to. diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md index 15fcc4d66b..8765222b21 100644 --- a/website/docs/writing_data.md +++ b/website/docs/writing_data.md @@ -9,7 +9,7 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; In this section, we will cover ways to ingest new changes from external sources or even other Hudi tables. -The two main tools available are the [DeltaStreamer](#deltastreamer) tool, as well as the [Spark Hudi datasource](#datasource-writer). +The two main tools available are the [DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) tool, as well as the [Spark Hudi datasource](#spark-datasource-writer). ## Spark Datasource Writer @@ -31,7 +31,7 @@ Default value: `"partitionpath"`<br/> **PRECOMBINE_FIELD_OPT_KEY** (Required): When two records within the same batch have the same key value, the record with the largest value from the field specified will be choosen. If you are using default payload of OverwriteWithLatestAvroPayload for HoodieRecordPayload (`WRITE_PAYLOAD_CLASS`), an incoming record will always takes precendence compared to the one in storage ignoring this `PRECOMBINE_FIELD_OPT_KEY`. <br/> Default value: `"ts"`<br/> -**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use.<br/> +**OPERATION_OPT_KEY**: The [write operations](/docs/write_operations) to use.<br/> Available values:<br/> `UPSERT_OPERATION_OPT_VAL` (default), `BULK_INSERT_OPERATION_OPT_VAL`, `INSERT_OPERATION_OPT_VAL`, `DELETE_OPERATION_OPT_VAL` @@ -39,7 +39,7 @@ Available values:<br/> Available values:<br/> [`COW_TABLE_TYPE_OPT_VAL`](/docs/concepts#copy-on-write-table) (default), [`MOR_TABLE_TYPE_OPT_VAL`](/docs/concepts#merge-on-read-table) -**KEYGENERATOR_CLASS_OPT_KEY**: Refer to [Key Generation](#key-generation) section below. +**KEYGENERATOR_CLASS_OPT_KEY**: Refer to [Key Generation](/docs/key_generation) section below. **HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY**: If using hive, specify if the table should or should not be partitioned.<br/> Available values:<br/> @@ -88,12 +88,12 @@ df.write.format("hudi"). :::info `mode(Overwrite)` overwrites and recreates the table if it already exists. You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key -(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in -[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) +(`uuid` in [schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in +[schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to +[Modeling data stored in Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: </TabItem> @@ -124,12 +124,12 @@ df.write.format("hudi"). :::info `mode(Overwrite)` overwrites and recreates the table if it already exists. You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key -(`uuid` in [schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/country/city`) and combine logic (`ts` in -[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to -[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi) +(`uuid` in [schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)), partition field (`region/country/city`) and combine logic (`ts` in +[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)) to ensure trip records are unique within each partition. For more info, refer to +[Modeling data stored in Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi) and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/writing_data). Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue -`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/writing_data#write-operations) +`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/write_operations) ::: </TabItem>