Hello, This is for a Lustre 2.10.3 file system with a single MDS and three OSSes. The MDS has a separate MGT and MDT both mounted on it, and each OSS has 5 OSTs that do not failover between the hosts. We use ZFS for the backend service that the devices live on for each of the Lustre targets.
Here is the layout of the ZFS pool digdug-meta on our MDS server containing both the MGT and MDT: NAME USED AVAIL REFER MOUNTPOINT digdug-meta 268G 453G 96K /digdug-meta digdug-meta/lustre2-mdt0 266G 453G 266G /digdug-meta/lustre2-mdt0 digdug-meta/mgs 4.10M 453G 4.10M /digdug-meta/mgs Yesterday, while attempting to add a new MDS server to act as a failover node for the MGT and MDT, I stopped all of the file system and all of the targets on the MDS (MGT and MDT) and OSSes. The new MDS server is 192.168.2.13@o2ib1 and the current MDS server is 192.168.2.14@o2ib1 After which, I ran the following command on the MGT and MDT: # tunefs.lustre --verbose --writeconf --erase-params --servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1 digdug-meta/mgs # tunefs.lustre --verbose --writeconf --erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 --servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1 digdug-meta/lustre2-mdt0 I ran an tunefs.lustre on each of the OSTs too, which followed the pattern: # tunefs.lustre --verbose --writeconf --erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 --servicenode=<OSS NID> digdug-ost#/lustre2 After I made that change, I started the MGT and MDT on the original MDS, which originally worked fine; then I started all of the OSTs, and even mounted a client, but when I tried to bring up the MGT and MDT on the new MDS node 192.168.2.13@o2ib1, it didn't work. I decided to just try and bring up the MGT and MDT back on the original MDS again and figure it out later, but now I can't get the MDT to mount on the original MDS either. I'm getting the following set of errors when trying to mount the MDT after the MGT has been mounted: May 19 13:53:09 mds02 systemd: Starting SYSV: Part of the lustre file system.... May 19 13:53:09 mds02 lustre: Mounting digdug-meta/mgs on /mnt/lustre/local/MGS May 19 13:53:09 mds02 lustre: mount.lustre: according to /etc/mtab digdug-meta/mgs is already mounted on /mnt/lustre/local/MGS May 19 13:53:11 mds02 lustre: Mounting digdug-meta/lustre2-mdt0 on /mnt/lustre/local/lustre2-MDT0000 May 19 13:53:11 mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed by user request. All servers must be restarted in order to regenerate the logs: rc = 0 May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(llog_osd.c:262:llog_osd_read_header()) lustre2-MDT0000-osd: bad log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 (expected 0x10645539) May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar message May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14@o2ib1: failed to copy remote log lustre2-MDT0000: rc = -5 May 19 13:53:12 mds02 kernel: LustreError: 13a-8: Failed to get MGS log lustre2-MDT0000 and no local copy. May 19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14@o2ib1: The configuration from log 'lustre2-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start server lustre2-MDT0000: -2 May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -2 May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000 May 19 13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete May 19 13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount (-2) May 19 13:53:12 mds02 lustre: mount.lustre: mount digdug-meta/lustre2-mdt0 at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or directory May 19 13:53:12 mds02 lustre: Is the MGS specification correct? May 19 13:53:12 mds02 lustre: Is the filesystem name correct? May 19 13:53:12 mds02 lustre: If upgrading, is the copied client log valid? (see upgrade docs) May 19 13:53:13 mds02 systemd: lustre.service: control process exited, code=exited status=2 May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre file system.. May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed state. May 19 13:53:13 mds02 systemd: lustre.service failed. This morning it was also discovered that the ZFS pool that contains the MGT and MDT has a permanent error that may also be impacting our ability to mount the MDT: # zpool status -v digdug-meta pool: digdug-meta state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: none requested config: NAME STATE READ WRITE CKSUM digdug-meta ONLINE 0 0 70 mirror-0 ONLINE 0 0 141 scsi-35000c5003017156b ONLINE 0 0 141 scsi-35000c500301715e7 ONLINE 0 0 141 scsi-35000c5003017158b ONLINE 0 0 141 scsi-35000c500301716a3 ONLINE 0 0 141 mirror-1 ONLINE 0 0 1 scsi-35000c5003017155f ONLINE 0 0 1 scsi-35000c500301715a7 ONLINE 0 0 1 scsi-35000c5003017159b ONLINE 0 0 1 scsi-35000c5003017158f ONLINE 0 0 1 errors: Permanent errors have been detected in the following files: digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0 I'm not sure what my next steps would be to recover this file system if at all possible, and would greatly appreciate any help from this group. Thank you in advance, Bob Torgerson
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
