yuzhaojing commented on PR #4309: URL: https://github.com/apache/hudi/pull/4309#issuecomment-1120182685
> Unless I am missing something here, can't the "storage" + "scheduler" + execute pieces, directly reuse what we have already. > > just writing down ideas in my head to see how we are different. > > a) Introduce a config `hoodie.skip.table.services` to `HoodieWriteConfig` which will make all writers skip any scheduling + execution of table services, if `true`. Writers throw an error if a lock provider is not configured and `hoodie.skip.table.services=true` > > b) TableManagenent server takes as input the uris of the Hudi metastore (so we have a clean dependency. Writers talk to metastore, the table management service picks up tables from metastore) or if we don't want to require the metastore, then we need to register tables as described here already. This can be a CLI command or an idempotent call from a write client. (I am fine either way) > > c) Then, for every write/commit on the table, the table management server is notified. Need some poll I think, or since push can be lost (we could do a smarter hybrid?). In response, the table management server will schedule relevant table services, right onto the table's timeline and notify a separate execution component/thread can start executing it. > > d) We still need to find a way to do HA here. Unlike the metastore, we need to shard by table here, since we probably want just one server to do the scheduling/execution for every table? > > Let me know If that makes sense ! I think we need to implement the storage and the scheduling part, but the execution part can directly reuse what we already have, such as `HoodieCompactor` and `HoodieClusteringJob`. a) Totally agree, that's exactly what I think. b) When using Metastore, my idea is to sense the table through callback. When the hoodie table commit instant, the TableManagenent hook in the Metastore triggers the scheduling of table services corresponding to the hoodie table. Using this method can avoid Pressure from TableManagenent listing. If we don't want to use the metastore, then we need to support CLI commands or idempotent calls from the write client, as it is possible to change the registration information without restarting the task, such as queue. c) Totally agree, we can push only when Metastore is not enabled, and list the corresponding hoodie table when TableManagenent has not received notification for a long time. d) I think we need stateless multi-instances to handle requests for all tables, each instance only handles incoming requests, we can use an optimistic mechanism to prevent concurrent scheduling or execution, we can put this in phase2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
