Hi Sumit, sorry for getting back somewhat late on this, let me share my opinion here as well as I will do in the JIRA ticket shortly.
As we discussed, the problem is that currently a rogue client can write blocks to DataNodes that are different from the Pipeline information that is provided for the client from Ozone Manager. This is true in secure and non-secure environments. As Neil mentioned this might compromise a container when SCM checks the replicas and figures out which are the over replicated container and if there are excess replicas which ones to delete, as if a rogue client writes a container to 3 nodes (even via STANDALONE replication type) and properly sync these writes bcsid associated with the container might go above the one in the good containers, and with that take over the precedence and make the old valid data to be removed potentially. As this can happen in a non secure environment, I strongly believe we should not touch the tokens as that does not solves the problem at all, as tokens are present only in a secured environment. I think the solution is within SCM, as if a DN does not have the container yet (it does not have a valid replica of the container), then at container creation an ICR is being triggered, and while that ICR is processed, that container should be marked as an invalid replica and SCM should issue a delete container to the DataNode reported the invalid container. (We should be able to determine that the container is invalid during ICR processing, as SCM should know which container belongs to which Pipeline and if the DN is not part of the Pipeline it should not report creation of a container with the specific container ID.) If possible Ozone Manager also should refuse the write and metadata update, based on information provided by SCM (either by caching the in flight write Pipelines and then the Pipelines reported by the client at the end of the write, or by directly checking the write location with SCM to validate the write). We should not include this information in the tokens I believe, as we don't gain anything with that, after implementing proper measures to deal with such rouge clients. Here is why: if the SCM instructs the DN within 2 heartbeats to remove the rogue container, then rogue clients will have 2HB of time (1 min by default if no container creation happens in between the 2 HB, but it happens... so less than 1 min) to occupy space from the cluster with garbage data, but in order to do that they need access permission the first time, and if they have access permissions, they can write garbage anyway to valid locations, so the only thing we need to prevent is messing up the container space and the OM metadata, and that is done with the proposed check in ICR and with the check at committing the write from the client to OM. Regards, Pifta Sumit Agrawal <sumitagra...@cloudera.com.invalid> ezt írta (időpont: 2022. nov. 29., K, 7:20): > Hi Devs, > > > 1. Related to HDDS-7454 < > https://issues.apache.org/jira/browse/HDDS-7454>, > need opinion if this requires handling or not, based on impact and > complexity. Below is given brief and same is present in Jira. > 2. > > > Please share opinion ... > > *For non-secure env* with raw/malicious client, below are cases > > 1) Writing to new DN will cause addition of container, can cause data loss > - Raised JIRA: HDDS-7552 <https://issues.apache.org/jira/browse/HDDS-7552> > > Will avoid writing / delete the container to the DN. > > 2) Writing new block to DN having container, causes additional blocks and > consuming space > > Impact: additional space consumption > > Note: no way to control in current design as OM and DN do not have any > sync, may need solution in future including Recon which can have OM, SMC > and DN information and mapping. > > 3) Writing with unknown container to DN causing addition of container - > Already handled using HDDS-3241 > <https://issues.apache.org/jira/browse/HDDS-3241> > > > > *For Secure env* as current bug, need opinion if required to be handled > based on impact, > > 1. Authorization of pipeline / DNs: Currently its not present as part of > this bug. Its suggested to be add as part of block token. > > > > Pros: > > - Avoid writing to DN for which its is not intended, and avoid malicious > impact of data loss, space consumption as shown above for non-secure env > impact. > > Cons: > > - Need have code for adding pipeline in token generation, passing and > validation at DNs > - Code will be complex, EC have different way of sync, inducing > complexity and failure points > > *Security Impact if this is not handled:* > > - SCM need validate new container using ICR which is Async, and need > atleast 2 heart beat to notify DN to avoid writting (30+ seconds). > - During this time, client can add a lot of block data during that time > - Exploitation is easy, but client should be authorized to get block > write permission > > > > -- > *Sumit Agrawal* | Senior Staff Engineer > cloudera.com <https://www.cloudera.com> > [image: Cloudera] <https://www.cloudera.com/> > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image: > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera > on LinkedIn] <https://www.linkedin.com/company/cloudera> > ------------------------------ > -- Pifta