kbuci commented on code in PR #12856: URL: https://github.com/apache/hudi/pull/12856#discussion_r2025326598
########## rfc/rfc-79/rfc-90.md: ########## @@ -0,0 +1,310 @@ +w<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE +file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to +You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License +at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an " +AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific +language governing permissions and limitations under the License. --> + +# Add support for cancellable clustering table service plans + +## Proposers + +Krishen Bhan (kbuci) + +## Approvers + +Sivabalan Narayanan (nsivabalan) + +## Status + +In Progress + +JIRA: HUDI-7946 + +## Abstract + +Clustering is a table service useed to optimize table/files layout in HUDI in order to speed up read queries. Currently +ingestion writers will abort if they attempt to write to the same data targetted by a pending clustering write. +As a result, clustering table service plans can indirectly delay ingestion writes from updating a dataset with recent data. +Furthermore, a clustering plan that isn't executed to completion for a large amount of time (due to repeated failures, application +misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, +archival, and metadata table compaction. This is because currently HUDI clustering plans, upon being scheduled, must be +executed to completion. This RFC proposes to support "Cancellable" Clustering plans. Support for such cancellable clustering plans +will provide HUDI an avenue to fully cancel a clustering plan and allow other table service and ingestion writers to proceed and avoiding +starvation based on user needs. + +## Background + +### Current state of Execution of table service operations in Hudi + +As of now, the table service operations `COMPACT` and `CLUSTER` are implicitly "immutable" plans by default, meaning +that once a plan is scheduled, it will stay as a pending instant until a caller invokes the table service execute API on +the table service instant and successfully completes it (referred to as "executing" a table service). Specifically, if an inflight +execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback +plan (which will delete all new instant/data files), but will keep the table service plan. And then the table service will be +re-attempted. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Goals Review Comment: > Granularity of cancellation: Right now, cancellation is at the entire plan level. Could we consider more granular partial cancellation of only the file groups that conflict with an ingestion writer? This would be more complex but could preserve partial progress in some advanced scenarios. We could update this RFC to handle that, by making the file in .cancel non-empty if only some file groups are "cancelled", and continue the commit if that is the case. My main concern though is that I'm not sure the best way to implement that on the clustering side, since a committed clustering where some file groups were "cancelled" will have replaced files in the final .replacecommit that don't appear in the clustering repalcecommit.requested plan. So I'm not sure if that might cause some confusion or other issue. > Applicability to other Table Services: I agree compaction is typically not as big a concern for blocking ingestion, but from a user’s perspective, having a consistent approach to cancellation could be valuable across all table services. Moreover, I was wondering if introducing an abort state in the timeline that can be applicable for all actions might be more useful instead of having cancellable action with abort state. Any action could be aborted due to usual conflict resolution or force quit (cancelled). For some background context, when I first brainstormed this RFC I did not have an .aborted state in my design, and instead I was thinking that a clustering plan that was "cancelled" could be rolled back completely similar to ingestion. But after discussing with @nsivabalan and other HUDI committers, I got the impression that HUDI users/committers wanted move away from having non-archival operations delete instants from the active timeline forever, and instead a different system/format where the "aborted" plan would still be on active timeline. The idea being that this would make debugging and future implementation changes easier. Because of this, I thought of adding a .aborted state to this RFC to make it aligned with that goal. Going back to your original question here, as part of the implementation of supported .aborted state, I was thinking of enabling HUDI to consider .aborted as a valid state for all action types, since the implementation should involve ignoring data files/log file blocks that are associated with an aborted instant, the same way inflight instants are handled. In addition, .aborted instants should be deleted during archival (as mentioned in rollout plan). In terms of implementing cancellation for compaction though, if there are log blocks that were added to a file group associated with an inflight compaction plan, and said plan was aborted, I'm not sure if any other changes are needed to handle those log files. In 1.x I think this might not be a big issue but we would need to verify. Because of this complication (and the fact that there isn't as strong motivation as far as I know ) I was thinking of adding clean/compaction as an optional or later addition. > Conflict Resolution vs. Hard Cancel: Instead of automatically canceling every inflight plan that conflicts, an alternative is to "wait or forcibly proceed" if the concurrency model supports it (such as custom lock with ingestion writer preference). But that naturally re‐introduces the risk of ingestion blocking. So your approach is simpler and safer from ingestion’s perspective, but it might be worth stating explicitly that partial or "smarter" concurrency resolution is out of scope. Sure, let me clarify in RFC that we are opting for a simple/safer approach where, if a clustering plan is cancellable, we assume that ingestion should take "precedence". > Performance Overheads - The design calls for heartbeats, cancel folder checks, new transactions, etc. In practice, these operations should be relatively light overhead compared to typical big Spark tasks, but it is good to note a minimal overhead might exist. Yes these checks are lightweight (since during ingestion we anyways right now are reading all inflight clustering plans while holding a table lock). But let me state at the end that there will be a little bit more overhead now during clustering execution (Due to taking a lock and checking for heartbeat). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org