[GitHub] [hudi] boneanxs commented on pull request #4905: [WIP] Fix if user specify key "hoodie.datasource.clustering.async.enable" directly, async clustering not work

GitBox Mon, 28 Feb 2022 01:14:18 -0800


boneanxs commented on pull request #4905:
URL: https://github.com/apache/hudi/pull/4905#issuecomment-1054044282



   > parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()) 
while instantiating writeConfig may or may not work. Even though we have added 
alternate keys, not sure what exactly gets returned as part of 
DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key().
   
   I think `DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()` returned 
`hoodie.clustering.async.enabled`
   
   Also, after landing https://github.com/apache/hudi/pull/4828,  there are two 
kinds of situations from users side, 
   
   1. If users enable clustering by 
`DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key` or 
`HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE.key`, the clustering service 
can be enabled, because in the configMap, it will store 
`hoodie.clustering.async.enabled -> true`
   ```scala
   df.option(DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key, true)
   // or
   df.option(HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE.key, true)
   ```
   
   2. But if users enable clustering by key's value directly(which means using 
`hoodie.datasource.clustering.async.enable` or 
`hoodie.clustering.async.enabled`), there could be some differences,
   
   ```scala
   // will not start clustering because codes
   // parameters.get(DataSourceOptions.ASYNC_CLUSTERING_ENABLE.key).exists(r => 
r.toBoolean) in method
   // isAsyncClusteringEnabled use value hoodie.clustering.async.enabled to 
check. 
   df.option("hoodie.datasource.clustering.async.enable", true)
   // will work
   df.option("hoodie.clustering.async.enabled", true)
   ```
   from DataSourceUtils
   
   ```java
   public static HoodieWriteConfig createHoodieConfig(String schemaStr, String 
basePath,
         String tblName, Map<String, String> parameters) {
   
       boolean asyncClusteringEnabled = 
Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()));
   
   return builder.forTable(tblName)
           .withCompactionConfig(HoodieCompactionConfig.newBuilder()
               
.withPayloadClass(parameters.get(DataSourceWriteOptions.PAYLOAD_CLASS_NAME().key()))
               .withInlineCompaction(inlineCompact).build())
           .withClusteringConfig(HoodieClusteringConfig.newBuilder()
               .withInlineClustering(inlineClusteringEnabled)
               .withAsyncClustering(asyncClusteringEnabled).build())
           
.withPayloadConfig(HoodiePayloadConfig.newBuilder().withPayloadOrderingField(parameters.get(DataSourceWriteOptions.PRECOMBINE_FIELD().key()))
               .build())
           // override above with Hoodie configs specified as options.
           .withProps(parameters).build();
   ```
   if we set `df.option("hoodie.datasource.clustering.async.enable", true)`, 
`HoodieWriteConfig` will both contain 
`hoodie.datasource.clustering.async.enable -> true` and 
`hoodie.clustering.async.enabled -> false`, because it uses 
`parameters.get(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key()))` which 
equals to `parameters.get("hoodie.clustering.async.enabled")` to check async 
clustering is enabled or not, which of cause returns false, while it will also 
save all parameters to `HoodieWriteConfig` by `.withProps(parameters).build()`, 
so `hoodie.datasource.clustering.async.enable -> true` will also be saved. As 
method `HoodieWriteConfig.isAsyncClusteringEnabled` first use key 
`hoodie.clustering.async.enabled` to check, then use alternative key 
`hoodie.datasource.clustering.async.enable` to check, so it will be false.
   
   From my understanding, we can remove these parameters related checks in the 
pr(Looks these checks also is unnecessary), and only use 
`HoodieWriteConfig.isAsyncClusteringEnabled` to check if async clustering is 
enabled or not. In this way, following usages 
`DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key`, 
`HoodieClusteringConfig.ASYNC_CLUSTERING_ENABLE().key`, 
`hoodie.datasource.clustering.async.enable` and 
`hoodie.clustering.async.enabled` all works.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs commented on pull request #4905: [WIP] Fix if user specify key "hoodie.datasource.clustering.async.enable" directly, async clustering not work

Reply via email to