One more addition, I also the following message on the oss who had the ost
before the failover:
Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID:
not available for connect from 172.23.53.214@o2ib4 (no target). If you are
running an HA pair check that the target is mounted on the other server.

On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <[email protected]> wrote:

> Hi Colin,
>
> I've added here 3 log file 1 from the metadata and 2 from the object
> stores.
> Before this logs started the filesystem was working, then I requested the
> cluster to failover muse-OST0001 from oss01 to oss02.
>
>
> On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote:
>
>> Hi Koos,
>>
>> First thing -- it's generally a bad idea to run newer server versions
>> with older clients (the opposite isn't true).
>>
>> Second -- do you have any logging that you can share from the client
>> itself? (dmesg, syslog, etc)
>>
>> A quick test may be to run 2.12.7 clients against your cluster to verify
>> that there is no interop problem.
>>
>> -cf
>>
>>
>> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> We have build a lustre cluster server environment on CentOS7 and lustre
>>> 2.12.7
>>> The clients are using 2.12.5
>>> The setup is 3 clusters for a 3PB filesystem
>>> One cluster is a two node cluster built for MGS and MDT's
>>> The other two clusters are also two node cluster used for the OST's
>>> The cluster framework is working as expected.
>>>
>>> The servers are connected in a multirail network, because some clients
>>> are in IB and the other clients are on ethernet
>>>
>>> But we have the following problem. When an OST failover to the
>>> second node the clients are unable to contact the OST that is started on
>>> the oder node.
>>> The OST recovery status is waiting for clients
>>> When we fail it back it starts working again and the recovery status is
>>> compple
>>>
>>> We tried to abort the recovery but that does not work.
>>>
>>> We used these documents to build the cluster:
>>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)
>>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)
>>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)
>>>
>>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>>>
>>> I'm not sure what the next steps must be to find the problem and where
>>> to look.
>>>
>>> Best regards
>>> Koos Meijering
>>> ........................................................................
>>> HPC Team
>>> Rijksuniversiteit Groningen
>>> ........................................................................
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> [email protected]
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>
>>
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to