boneanxs commented on pull request #4905: URL: https://github.com/apache/hudi/pull/4905#issuecomment-1054044282
> parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()) while instantiating writeConfig may or may not work. Even though we have added alternate keys, not sure what exactly gets returned as part of DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key(). I think `DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()` returned `hoodie.clustering.async.enabled` Also, after landing https://github.com/apache/hudi/pull/4828, there are two kinds of situations from users side, 1. If users enable clustering by `DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key` or `HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE.key`, the clustering service can be enabled, because in the configMap, it will store `hoodie.clustering.async.enabled -> true` ```scala df.option(DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key, true) // or df.option(HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE.key, true) ``` 2. But if users enable clustering by key's value directly(which means using `hoodie.datasource.clustering.async.enable` or `hoodie.clustering.async.enabled`), there could be some differences, ```scala // will not start clustering because codes // parameters.get(DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key).exists(r => r.toBoolean) in method // isAsyncClusteringEnabled use value hoodie.clustering.async.enabled to check. df.option("hoodie.datasource.clustering.async.enable", true) // will work df.option("hoodie.clustering.async.enabled", true) ``` from DataSourceUtils ```java public static HoodieWriteConfig createHoodieConfig(String schemaStr, String basePath, String tblName, Map<String, String> parameters) { boolean asyncClusteringEnabled = Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key())); return builder.forTable(tblName) .withCompactionConfig(HoodieCompactionConfig.newBuilder() .withPayloadClass(parameters.get(DataSourceWriteOptions.PAYLOAD_CLASS_NAME().key())) .withInlineCompaction(inlineCompact).build()) .withClusteringConfig(HoodieClusteringConfig.newBuilder() .withInlineClustering(inlineClusteringEnabled) .withAsyncClustering(asyncClusteringEnabled).build()) .withPayloadConfig(HoodiePayloadConfig.newBuilder().withPayloadOrderingField(parameters.get(DataSourceWriteOptions.PRECOMBINE_FIELD().key())) .build()) // override above with Hoodie configs specified as options. .withProps(parameters).build(); ``` if we set `df.option("hoodie.datasource.clustering.async.enable", true)`, `HoodieWriteConfig` will both contain `hoodie.datasource.clustering.async.enable -> true` and `hoodie.clustering.async.enabled -> false`, because it uses `parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()))` which equals to `parameters.get("hoodie.clustering.async.enabled")` to check async clustering is enabled or not, which of cause returns false, while it will also save all parameters to `HoodieWriteConfig` by `.withProps(parameters).build()`, so `hoodie.datasource.clustering.async.enable -> true` will also be saved. As method `HoodieWriteConfig.isAsyncClusteringEnabled` first use key `hoodie.clustering.async.enabled` to check, then use alternative key `hoodie.datasource.clustering.async.enable` to check, so it will be false. From my understanding, we can remove these parameters related checks in the pr(Looks these checks also is unnecessary), and only use `HoodieWriteConfig.isAsyncClusteringEnabled` to check if async clustering is enabled or not. In this way, following usages `DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key`, `HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE().key`, `hoodie.datasource.clustering.async.enable` and `hoodie.clustering.async.enabled` all works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
