What's the impact of being in recovery mode with LNET health?

Le 06/03/2020 21:12, « lustre-discuss au nom de Chris Horn » 
<[email protected] au nom de [email protected]> a écrit :    
    
    > lneterror: 10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
    > lpni <address> added to recovery queue.  Health = 900
    
    The message means that the health value of a remote peer interface has been 
decremented, and as a result, the interface has been put into recovery mode. 
This mechanism is part of the LNet health feature.
    
    Health values are decremented when a PUT or GET fails. Usually there are 
other messages in the log that can tell you more about the specific failure. 
Depending on your network type you should probably see messages from socklnd or 
o2iblnd. Network congestion could certainly lead to message timeouts, which 
would in turn result in interfaces being placed into recovery mode.
    
    Chris Horn
    
    On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
<[email protected] on behalf of [email protected]> 
wrote:
    
        along the aforementioned error i also see these at the same time
    
        lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
        <...>-clilov-<...>; failed to send uevent qos_threshold_rr=100
    
        On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
        <[email protected]> wrote:
        >
        > On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien 
<[email protected]> wrote:
        > >
        > > Did you see any actual error on your system?
        > >
        > > Because there is a patch that is just decreasing the verbosity 
level of such messages, which looks like could be ignored.
        > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks&e=
        > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg&e=
        >
        > thanks.  it's not entirely clear just yet.  i'm trying to track down a
        > "slow jobs" issue.  i see these messages everywhere, so it might be a
        > non issue or a sign of something more pressing.
        _______________________________________________
        lustre-discuss mailing list
        [email protected]
        
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs&e=
    
    
    
    
    
    _______________________________________________
    lustre-discuss mailing list
    [email protected]
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to