Re: [PR] Introduce new native backup provider (KNIB) [cloudstack]

via GitHub Thu, 19 Mar 2026 13:03:36 -0700


JoaoJandre commented on PR #12758:
URL: https://github.com/apache/cloudstack/pull/12758#issuecomment-4092974611


   > ## Testing
   > 
   > **The patch coverage is 5.59%.** The classes that handle the most critical 
operations are almost completely untested.
   
   Sadly, essentially no PRs in this project cover a substantial amount of code 
with unit tests. I feel like we should not arbitrarily decide on which PRs 
should have coverage and which do not need to have. Maybe we should think to 
discuss these standards and create a protocol for future reference and 
enforcing in all PRs that are opened to the project.
   
   **However**, I will add more unit tests to the critical operations. And I 
hope that all PRs are up to the same standard from now on.
   
   > **This is a 12k+ line PR with no Marvin tests.** The manual test matrices 
in the PR description are good to have, but without automated integration 
tests, there is no way to catch regressions. Future changes to the VM snapshot 
code, volume lifecycle, or backup framework could silently break KNIB (or the 
other way around) with no automated detection.
   
   This feels like another standard that is being arbitrarily brought up. 
Again, perhaps we should think to discuss these standards and create a protocol 
for future reference and enforcing.
   
   Anyway, I will add a few Marvin tests as well.
   
   > If the secondary storage selector allows clean separation of backup 
storage from everything else, this can work. But this needs to be clearly 
documented — operators must understand that without proper selector 
configuration, backups will share space and capacity tracking with templates, 
ISOs, and snapshots.
   
   I plan to add a documentation where this is explained; I will work on a PR 
for the docs, and I will link it here.
   
   > The secondary storage selector JavaScript engine has had security issues — 
see CVE-2025-59302. While this is currently blocked by a global setting, 
extending the selector framework to also cover backups increases the surface 
area. This should at minimum be documented with a note about the possible 
impact.
   
   As far as I know, the security issues have been addressed already 
https://github.com/apache/cloudstack/issues/12523#issuecomment-3853102387. 
Furthermore, this is only accessible to Root Admins, so I hope they can avoid 
exploiting their own environment.
   
   > Backups are generally expected to be stored in a separate, remote location 
from the primary infrastructure. If the secondary storage is co-located, the 
value of backups (surviving an infrastructure failure) is reduced. Was this 
tested with any meaningful network latency between the compute hosts / SSVM and 
a remote storage used for backups? What latencies were observed?
    
   This is the same as for the concept of backup repositories in the NAS 
plugin. It is up to the people entering the configurations (appliances) to ACS.
   
   > The `BackupError` state has no recovery path — no admin API, no background 
self-healing task, nothing. The spec says an operator must read the logs and 
recover the VM manually. In a real deployment, a management server restart or 
agent timeout during a backup window could put many VMs into this state at 
once. The tenants can't start, stop, migrate, or snapshot their VMs until an 
admin manually fixes each one. At the very least, this limitation and its 
operational impact should be clearly documented.
   
   A management server restart will never put the VM in this state. However, an 
agent restart or timeout could.
   
   I fully understand that this state is scary, and I would not use it if it 
wasn't necessary. However, it was created to protect users from themselves. If 
an unexpected error occurs which ACS has no way of knowing the current state of 
the VM, it is best to stop the user from trying to operate on VMs in this state.
   
   In a perfect world, ACS would be able to detect all possible inconsistencies 
and fix them; however, this could be very dangerous, if a bug were to occur and 
ACS “fixed” a VM in the wrong way, it could make the situation much worse. 
Therefore, that is why we decided to work a more defensive way and to avoid 
enabling users to operate on VMs with inconsistent states.
   
   We could create a "self-healing" process that scans VMs in this state and if 
they fit a few criteria, fixes them. I would like to implement such mechanism, 
but its nonexistence should not stop this PR from moving; again, users of ACS 
do not need to use this plugin, if they do not trust it. This type of 
inconsistency is already happening with snapshots and bitmaps corruptions that 
need to be manually addressed, and/or the NAS backup plugin, and so on. For 
example, see here: https://github.com/apache/cloudstack/issues/12821 and here 
https://github.com/apache/cloudstack/issues/12829.
   
   In any case, I will document the reason this state exists, a few possible 
reasons for the VM to be in this state, and a general checklist on how to go 
about fixing it.
   
   > ### Changes to Existing Features
   > 
   > This PR modifies 224 files. Many of the changes are not isolated to the 
KNIB plugin — they touch shared infrastructure.
   > Affected areas
   > 
   > VM state machine (VirtualMachine.java, UserVmManagerImpl.java) — two new 
states added. Every component that checks VM state is affected.
   
   Could you point out where is the issue here? 
   
   > VM snapshot strategy (KvmFileBasedStorageVmSnapshotStrategy.java, 82 new 
lines, 0% coverage) — the disk-only VM snapshot revert, create, and delete 
paths are modified to account for backup deltas. This is an existing feature 
that other users depend on. A bug here breaks VM snapshots for everyone, not 
just KNIB users.
   
   I was the one who wrote 100% of the implementation on this. I made the spec, 
the implementation, and I am the one supporting it. If I introduce a bug here, 
I will fix it. I will also add a few unit tests to the affected processes.
   
   > Volume operations — volume attach, detach, destroy, and migrate paths are 
all modified to handle backup deltas.
   
   The main path was not changed. An exception was added so that only when 
using KNIB, and only if the VM uses backups, a few more processes are executed. 
Is this a problem?
   
   > Secondary storage selectors (HeuristicType.java) — extended for backups.
   
   Yes.
   
   > I don't have a strong opinion on the exact name, but I find "native" a bit 
unclear — native to what? KVM? CloudStack? The hypervisor? Something more 
descriptive of the actual mechanism would help operators understand what 
they're enabling. I'll leave this to the community to decide.
   
   Native to ACS, as explained in 
https://github.com/apache/cloudstack/pull/12758#issuecomment-4069106439. I 
mean, you need to fully read what is written there. It is very literal the use 
of the "native" word. We used "native" to represent that ACS is executing the 
backup processes for the hypervisor KVM. Anyway, I'm open to change the name, 
thanks to you guys, I realize that maybe another name would be best. What do 
you think of the proposed names here 
https://github.com/apache/cloudstack/pull/12758#issuecomment-4082573985?
   
   > I'd also note that the [coding 
conventions](https://cwiki.apache.org/confluence/display/CLOUDSTACK/Coding+conventions)
 document was written by the community and will keep evolving. Citing it as a 
definitive authority on what is or isn't acceptable doesn't fully capture how 
we work — the conventions are a living document, and community consensus on 
what is good for the project's future carries equal weight.
   
   Therefore, claiming that "Native" goes against the convention is also 
pointless, correct?
   
   > ## Separate APIs vs Reusing Existing APIs
   > 
   > I would strongly prefer reusing the existing backup offering APIs rather 
than creating parallel ones. The existing `importBackupOffering` API could be 
extended — for example, making `externalId` optional or using some other value 
like `internal`, `native` or whatever makes sense as value for this. This keeps 
a single unified offering model in the framework, the UI, and the documentation.
   
   While I was not the one that introduced the backup framework, looking at its 
design, it was clearly intended to have as little configuration as possible on 
the ACS side while leaving these details to the backup provider. If we add 
these parameters to the import backup offering API, I'm sure a lot of users 
will be confused when they do nothing for Veeam and Dell Networker. I did not 
intend to warp the original design of configuring the offerings on the provider 
side and only importing it to ACS.
   
   
   This is why I created the concept of native offerings. As with KNIB (and 
NAS) the provider is ACS, and thus the configurations can still be made on the 
"backup provider", and the import API will follow the same logic it always had.
   
   
   Since the community is strongly against creating these APIs. I shall follow 
the consensus and remove them, the concept of "native" backup offerings will no 
longer exist. I will add a single new API, called `createBackupOffering`. It 
will, as the name suggests, create a new backup offering. It will have all the 
necessary parameters to create a new backup offering, and the parameters that 
KNIB needs. This API will do nothing for other providers for now.
   On the GUI, it will be a new button on the same page as the current backup 
offerings.
   Regarding the table for native backup offerings, it will be absorbed into 
the backup offering details table.
   
   I strongly oppose using the `importBackupOffering` API to create a 
definition of a backup offering, this is not what the semantics of the API name 
suggests, and it is confusing. It should be used to import offerings that were 
already defined externally.
   
   > ## On Reusing and Improving Existing Code
   > 
   > One of the strengths of a community project is that we, as a community, 
collectively own and improve the codebase. _None of us are working on code that 
"belongs" to us individually_ — we are all changing, refactoring, and improving 
code that someone else wrote years ago, sometimes by contributors who are no 
longer active in the project. That is how open source works and how the project 
stays healthy.
   >
   > When existing code has gaps or quality concerns, the community expectation 
is that we improve it — not build a parallel implementation alongside it. If 
the existing NAS backup plugin has shortcomings (no incremental support, no 
compression, no validation), those are opportunities for contribution that 
benefit all users of the existing plugin.
   >
   > Building KNIB as a completely separate provider with its own offering 
APIs, its own DB tables, and its own storage model means the NAS plugin's gaps 
remain unaddressed, and now we have two native backup implementations to 
maintain. This also sets a precedent: if the reasoning is "the existing code 
doesn't meet our standards, so we'll build our own," then any contributor could 
justify creating parallel implementations of any subsystem. Over time, this 
fragments the codebase rather than strengthening it.
   >
   > The Apache Ethos emphasizes that community contributions should strengthen 
the commons, not create parallel silos. I'd encourage us all — including myself 
— to keep this principle in mind.
   
   There is a multitude of reasons why KNIB was developed as a separate plugin 
instead of working with NAS. I will outline the most important ones:
   1. The spec was done and early development was already underway internally 
when NAS was first proposed to the community. Since from the beginning we could 
see that both were going in very different directions, I decided to continue 
the KNIB implementation.
   2. NAS and KNIB have vastly different approaches to many concepts. To make 
NAS work as KNIB and have the same features as KNIB, I would have to transform 
it into KNIB, which I'm sure the community would be against. 
   3. One plugin does not have to substitute the other. Both existing is a good 
thing, since they have different ways of working, users have more freedom to 
choose which plugin best fits their needs.
   
   > ## Security
   > ### Validation Command Injection
   > 
   > The `backupValidationCommand` VM setting is user-writable by default. From 
what I can see, the command and arguments are put into a JSON string using 
`String.format()`:
   > 
   > ```java
   > String GUEST_EXEC_COMMAND = "{\"execute\": \"guest-exec\", 
\"arguments\":{\"path\":\"%s\",\"arg\":%s,\"capture-output\":true}}";
   > ```
   
   While it is user-writable by default, validation is a feature that only 
RootAdmins can enable. While I certainly can make it an Admin only setting by 
default, I'm sure that most who want to use this feature will let their users 
edit it. This is because the validation is inherently VM dependent, and thus 
most of the time only the owner of the VM will know what needs to be validated.
   
   > Example of how this can be exploited
   
   > If the path value is not sanitized, a crafted command string can break out 
of the JSON structure. For example:
   
   > String script = "validation.sh;\nls 
/etc/\n\",\"extra_param\":\"injected_value";
   > String command = String.format(GUEST_EXEC_COMMAND, script, "[]");
   
   > This would produce malformed JSON that could be interpreted differently by 
the QEMU guest agent.
   
   I tested informing this string `"validation.sh;\nls 
/etc/\n\",\"extra_param\":\"injected_value"` in the `backupValidationCommand` 
setting, however, no injection occured, only an error:
   ```
   cannot parse json {"execute": "guest-exec", 
"arguments":{"path":""validation.sh;\nls 
/etc/\n\",\"extra_param\":\"injected_value"","arg":[],"capture-output":true}}: 
lexical error: invalid char in json text.
   ```
   Did you test this?
   
   Furthermore, this JSON is passed directly to Qemu and is used as part of the 
user backup validation process. Are you concerned that the user will sabotage 
their own backups validations? 
   
   If you find a vulnerability and an example of how to exploit it, I will 
address it. But please make sure it is really a vulnerability first.
   
   > ### Validation Resource Exhaustion
   > 
   > I don't see a hard cap on the total validation lifecycle time. A user 
could repeatedly trigger backups on stopped VMs with validation-enabled 
offerings, causing validation VMs to be created on compute hosts. Combined with 
`enforce.resource.limit.on.backup.validation.vm` defaulting to `false` and 
`backup.validation.max.concurrent.operations` defaulting to `0` (unlimited), 
this could lead to resource exhaustion on compute hosts, affecting other 
tenants.
   > 
   > Repeated backup commands on stopped VMs could potentially cause Libvirt to 
become unresponsive on the host, which would then impact all VMs on that host.
   
   Since `backup.validation.max.concurrent.operations.per.host` is 1 by 
default, at most one validation may occur on each host.
   
   Furthermore, this is a question of sensible default configurations, the 
current default configurations are like this based on feedback from users. I am 
surely open to changing these defaults. Do you think that we should further 
constrain the default validation configurations? We could change the 
`enforce.resource.limit.on.backup.validation.vm` to true by default, for 
example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Introduce new native backup provider (KNIB) [cloudstack]

Reply via email to