Hi Colin, I have a small drawing which represents the setup, it's attached.
On Fri, 19 Nov 2021 at 22:49, Colin Faber <[email protected]> wrote: > Hi Koos, > > One thing you mentioned that I should have picked up on sooner, was "The > servers are connected in a multirail network, because some clients are in > IB and the other clients are on ethernet" > > Can you describe your topology? How are the various elements connected to > each other? > > -cf > > > On Fri, Nov 19, 2021 at 5:38 AM Meijering, Koos <[email protected]> > wrote: > >> One more addition, I also the following message on the oss who had the >> ost before the failover: >> Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: >> not available for connect from 172.23.53.214@o2ib4 (no target). If you >> are running an HA pair check that the target is mounted on the other server. >> >> On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <[email protected]> wrote: >> >>> Hi Colin, >>> >>> I've added here 3 log file 1 from the metadata and 2 from the object >>> stores. >>> Before this logs started the filesystem was working, then I requested >>> the cluster to failover muse-OST0001 from oss01 to oss02. >>> >>> >>> On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote: >>> >>>> Hi Koos, >>>> >>>> First thing -- it's generally a bad idea to run newer server versions >>>> with older clients (the opposite isn't true). >>>> >>>> Second -- do you have any logging that you can share from the client >>>> itself? (dmesg, syslog, etc) >>>> >>>> A quick test may be to run 2.12.7 clients against your cluster to >>>> verify that there is no interop problem. >>>> >>>> -cf >>>> >>>> >>>> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss < >>>> [email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We have build a lustre cluster server environment on CentOS7 and >>>>> lustre 2.12.7 >>>>> The clients are using 2.12.5 >>>>> The setup is 3 clusters for a 3PB filesystem >>>>> One cluster is a two node cluster built for MGS and MDT's >>>>> The other two clusters are also two node cluster used for the OST's >>>>> The cluster framework is working as expected. >>>>> >>>>> The servers are connected in a multirail network, because some clients >>>>> are in IB and the other clients are on ethernet >>>>> >>>>> But we have the following problem. When an OST failover to the >>>>> second node the clients are unable to contact the OST that is started on >>>>> the oder node. >>>>> The OST recovery status is waiting for clients >>>>> When we fail it back it starts working again and the recovery status >>>>> is compple >>>>> >>>>> We tried to abort the recovery but that does not work. >>>>> >>>>> We used these documents to build the cluster: >>>>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS) >>>>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS) >>>>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS) >>>>> >>>>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services >>>>> >>>>> I'm not sure what the next steps must be to find the problem and where >>>>> to look. >>>>> >>>>> Best regards >>>>> Koos Meijering >>>>> >>>>> ........................................................................ >>>>> HPC Team >>>>> Rijksuniversiteit Groningen >>>>> >>>>> ........................................................................ >>>>> _______________________________________________ >>>>> lustre-discuss mailing list >>>>> [email protected] >>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >>>>
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
