yuzhaojing commented on PR #4309:
URL: https://github.com/apache/hudi/pull/4309#issuecomment-1120182685

   > Unless I am missing something here, can't the "storage" + "scheduler" + 
execute pieces, directly reuse what we have already.
   > 
   > just writing down ideas in my head to see how we are different.
   > 
   > a) Introduce a config `hoodie.skip.table.services` to `HoodieWriteConfig` 
which will make all writers skip any scheduling + execution of table services, 
if `true`. Writers throw an error if a lock provider is not configured and 
`hoodie.skip.table.services=true`
   > 
   > b) TableManagenent server takes as input the uris of the Hudi metastore 
(so we have a clean dependency. Writers talk to metastore, the table management 
service picks up tables from metastore) or if we don't want to require the 
metastore, then we need to register tables as described here already. This can 
be a CLI command or an idempotent call from a write client. (I am fine either 
way)
   > 
   > c) Then, for every write/commit on the table, the table management server 
is notified. Need some poll I think, or since push can be lost (we could do a 
smarter hybrid?). In response, the table management server will schedule 
relevant table services, right onto the table's timeline and notify a separate 
execution component/thread can start executing it.
   > 
   > d) We still need to find a way to do HA here. Unlike the metastore, we 
need to shard by table here, since we probably want just one server to do the 
scheduling/execution for every table?
   > 
   > Let me know If that makes sense !
   
   I think we need to implement the storage and the scheduling part, but the 
execution part can directly reuse what we already have, such as 
`HoodieCompactor` and `HoodieClusteringJob`.
   
   a) Totally agree, that's exactly what I think.
   
   b) When using Metastore, my idea is to sense the table through callback. 
When the hoodie table commit instant, the TableManagenent hook in the Metastore 
triggers the scheduling of table services corresponding to the hoodie table. 
Using this method can avoid Pressure from TableManagenent listing. If we don't 
want to use the metastore, then we need to support CLI commands or idempotent 
calls from the write client, as it is possible to change the registration 
information without restarting the task, such as queue.
   
   c) Totally agree, we can push only when Metastore is not enabled, and list 
the corresponding hoodie table when TableManagenent has not received 
notification for a long time.
   
   d) I think we need stateless multi-instances to handle requests for all 
tables, each instance only handles incoming requests, we can use an optimistic 
mechanism to prevent concurrent scheduling or execution, we can put this in 
phase2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to