Thank you very much for the effort to move it back to the master.

I am +1 to merge it. Seems to be a relative small change. Only replication manager in SCM and containers states are changed on the master. New features are well isolated, seems to be easy to test and work on it on the master.

I would suggest creating a PR and execute full CI 2-3 times, but if it's as stable as master should be ok to merge it.

Do we need a VOTE thread as we had it in Hadoop?

Marton





On 11/2/20 12:34 PM, Stephen O'Donnell wrote:
Hi All,

Has anyone else had a chance to look at this, and have any comments on the
overall approach?

Thanks,

Stephen.

On Tue, Oct 27, 2020 at 10:39 AM Lin, Yiqun <yiq...@ebay.com.invalid> wrote:

There is no limitation on the DN side for this. I need to check the SCM
     read path to ensure nodes which are DECOMMISSIONING or
ENTERING_MAINTENANCE
     are still returned when OM requests the block locations. I agree this
is
     important and we need to ensure these nodes can still be read
It would be better we could have the corresponding test case cover this
scenario. We can add this after the merge,  : ).

  At the moment no. I feel this is something we should control in
Replication
     Manager rather than decommissioning. We already have seen issues with
RM
     where too many in-flight replication commands are sent to the DNs,
which
     cannot complete them in time, and then more get scheduled etc. Each DN
has
     a replication limit, so I think we need to enhance RM to hold back the
     commands until the DNs have capacity to service them. We may also want
to
     give priority to under replicated containers due to a dead node rather
than
     decommissioning containers etc.
Agree that we could do this enhancement in RM level.

That is certainly something that can be added, and I would see as one of
     the "usability enhancements" I mentioned. What we can do is create a
new
     epic Jira for "post branch merge enhancements" and start collecting
these
     suggestions there?
Makes sense to me.

Thanks Stephen for the comments, I don't have further comments now.

Thanks,
Yiqun


On 2020/10/27, 5:35 PM, "Stephen O'Donnell" <sodonn...@cloudera.com.INVALID>
wrote:

     External Email

     Hi Yiqun,

     Thanks for taking a look.

     > Does the container data can be read by client side when container
node is
     in DECOMMISSIONING/ DECOMMISSIONED state? If the container cannot be
     accessed, it can lost containers in a short time when multiple nodes
be in
     decommissioning.

     There is no limitation on the DN side for this. I need to check the SCM
     read path to ensure nodes which are DECOMMISSIONING or
ENTERING_MAINTENANCE
     are still returned when OM requests the block locations. I agree this
is
     important and we need to ensure these nodes can still be read.

     > Do we have the rate limitation control for the node decommission?

     At the moment no. I feel this is something we should control in
Replication
     Manager rather than decommissioning. We already have seen issues with
RM
     where too many in-flight replication commands are sent to the DNs,
which
     cannot complete them in time, and then more get scheduled etc. Each DN
has
     a replication limit, so I think we need to enhance RM to hold back the
     commands until the DNs have capacity to service them. We may also want
to
     give priority to under replicated containers due to a dead node rather
than
     decommissioning containers etc.

     > For above command usage, will we support input the node with given a
     input node list file, that will be useful for admin users to use this
     feature.

     That is certainly something that can be added, and I would see as one
of
     the "usability enhancements" I mentioned. What we can do is create a
new
     epic Jira for "post branch merge enhancements" and start collecting
these
     suggestions there?

     Thanks,

     Stephen.


     On Tue, Oct 27, 2020 at 7:09 AM Lin, Yiqun <yiq...@ebay.com.invalid>
wrote:

     > Hi Stephen,
     >
     > I haven't reviewed much of the decommission feature code but have a
look
     > for the overview doc you attached.
     >
     > Just some questions and comments from me:
     >
     > * Does the container data can be read by client side when container
node
     > is in DECOMMISSIONING/ DECOMMISSIONED state? If the container cannot
be
     > accessed, it can lost containers in a short time when multiple nodes
be in
     > decommissioning.
     > * Do we have the rate limitation control for the node decommission?
Large
     > number of nodes concurrently  decommissioned, lots of closed
containers be
     > in replication. And this can impact the performance of SCM I think.
     >
     > Minor suggestion:
     > ozone admin datanode decommission <list of nodes to remove>
     > ozone admin datanode maintenance <list of nodes to put to
maintenance >
     > ozone admin datanode recommission <list of nodes to recommission>
     >
     > For above command usage, will we support input the node with given a
input
     > node list file, that will be useful for admin users to use this
feature.
     >
     > Thanks,
     > Yiqun
     >
     > On 2020/10/27, 2:09 AM, "Stephen O'Donnell" <sodonn...@cloudera.com
.INVALID>
     > wrote:
     >
     >     External Email
     >
     >     Someone reported that the attachment did not come through -
perhaps the
     >     mailing strips out attachments?
     >
     >     I have attached it to the HDDS-1880 jia - here is the direct
link:
     >
     >
     >
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fsecure%2Fattachment%2F13014144%2FDecommission%2520and%2520Maintenance%2520Overview.pdf&amp;data=04%7C01%7Cyiqlin%40ebay.com%7C1237e97778984f69cccb08d87a5ba8e0%7C46326bff992841a0baca17c16c94ea99%7C0%7C1%7C637393881378701411%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=y6D5uNvNeOoqlRQysvG%2B7akMzcEgNVwPGxcQFLffdSw%3D&amp;reserved=0
     >
     >     Thanks,
     >
     >     Stephen.
     >
     >     On Mon, Oct 26, 2020 at 5:47 PM Stephen O'Donnell <
     > sodonn...@cloudera.com>
     >     wrote:
     >
     >     > Hi All,
     >     >
     >     > I am pleased to announce the Datanode Decommission and
Maintenance
     > feature
     >     > for Ozone -
     >
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHDDS-1880&amp;data=04%7C01%7Cyiqlin%40ebay.com%7C1237e97778984f69cccb08d87a5ba8e0%7C46326bff992841a0baca17c16c94ea99%7C0%7C1%7C637393881378711404%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Tq%2B4pc8rV8Hovk0GdaI%2FW2SZo0Qd511%2BDdnSzcK%2FMwM%3D&amp;reserved=0
     >     >
     >     > The feature is working in Integration tests and also via
     > docker-compose.
     >     > There is still some work to improve monitoring and usability,
but I
     > believe
     >     > the feature is now complete enough to merge into master and
continue
     >     > development there.
     >     >
     >     > I would like to use this thread to discuss the feature and
agree on
     >     > whether we can merge it into master. To help with the
discussion, I
     > have
     >     > attached a short document describing the major changes.
     >     >
     >     > The decommission changes are all on the branch HDDS-1880-Decom.
     >     >
     >     > Please reply here with any questions and comments.
     >     >
     >     > Thanks,
     >     >
     >     > Stephen.
     >     >
     >
     >
     > ---------------------------------------------------------------------
     > To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
     > For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org
     >
     >




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Reply via email to