While walking through the code, found a lnet module parameter local_nid_dist_zero. Setting it to 0 resolves the issue. Just putting it here if anyone searching for the same thing in the future.
On Wed, 6 Nov 2024 at 13:39, Backer <backer.k...@gmail.com> wrote: > Hi Chris, > > Thank you looking in to this. I agree. In cloud and other type of > networks on-Prem, floating ip is real thing providing ha and I am > attempting to make it work. Since ip move happens within subseconds in > these environment, the failover happens within a few seconds and even > notice Any delay. This optimization is an undesired optimization in certain > environment. If there is no param already exists for a behavior change, > how I can make it work within this environment? I wonder if it requires a > code change? If so, I could look in to it if someone can help with some > pointers. > > Regards > > Aboo > > On Wed, Nov 6, 2024 at 11:05 AM Horn, Chris <chris.h...@hpe.com> wrote: > >> Here the failover is designed in such a way that the IP address moves >> (fails over) with OST and becomes active on the other server. >> >> >> >> This is probably the source of your problem. I would suggest assigning >> unique IP addresses to each OSS. >> >> >> >> Chris Horn >> >> >> >> *From: *lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on >> behalf of Backer <backer.k...@gmail.com> >> *Date: *Tuesday, November 5, 2024 at 10:19 PM >> *To: *Backer via lustre-discuss <lustre-discuss@lists.lustre.org>, >> lustre-de...@lists.lustre.org <lustre-de...@lists.lustre.org> >> *Subject: *Re: [lustre-discuss] Lustre switching to loop back lnet >> interface when it is not desired >> >> Any ideas on how to avoid using 0@lo as failover_nids? Please see below. >> >> >> >> On Tue, 5 Nov 2024 at 12:34, Backer <backer.k...@gmail.com> wrote: >> >> Hi, >> >> >> >> Mounting the Lustre file file system on the OSS. Some of the OSTs are >> locally attached to the OSS. >> >> The failover IP on the OST is "10.99.100.152". It is a local lnet on the >> OSS. However, when the client mounts it, the import automatically changes >> to 0@lo. It is undesirable here because when this OST fails over to >> another server, the client is still trying to connect to 0@lo while it >> is no longer on the same host. This makes the client fs mount hangs for >> ever. >> >> Here the failover is designed in such a way that the IP address moves >> (fails over) with OST and becomes active on the other server. >> >> How can I make the import pointing to the real IP and not the loopback? >> (so that the failover works) >> >> >> >> >> >> [oss000 ~]$ lfs df >> UUID 1K-blocks Used Available Use% Mounted on >> fs-MDT0000_UUID 29068444 25692 26422344 1% /mnt/fs[MDT:0] >> fs-OST0000_UUID 50541812 30160292 17743696 63% /mnt/fs[OST:0] >> fs-OST0001_UUID 50541812 29301740 18602248 62% /mnt/fs[OST:1] >> fs-OST0002_UUID 50541812 29356508 18547480 62% /mnt/fs[OST:2] >> fs-OST0003_UUID 50541812 8822980 39081008 19% /mnt/fs[OST:3] >> >> filesystem_summary: 202167248 97641520 93974432 51% /mnt/fs >> >> [oss000 ~]$ df -h >> Filesystem Size Used Avail Use% Mounted on >> devtmpfs 30G 0 30G 0% /dev >> tmpfs 30G 8.1M 30G 1% /dev/shm >> tmpfs 30G 25M 30G 1% /run >> tmpfs 30G 0 30G 0% /sys/fs/cgroup >> /dev/mapper/ocivolume-root 36G 17G 19G 48% / >> /dev/sdc2 1014M 637M 378M 63% /boot >> /dev/mapper/ocivolume-oled 10G 2.5G 7.6G 25% /var/oled >> /dev/sdc1 100M 5.1M 95M 6% /boot/efi >> tmpfs 5.9G 0 5.9G 0% /run/user/987 >> tmpfs 5.9G 0 5.9G 0% /run/user/0 >> /dev/sdb 49G 28G 18G 62% /fs-OST0001 >> /dev/sda 49G 29G 17G 63% /fs-OST0000 >> tmpfs 5.9G 0 5.9G 0% /run/user/1000 >> 10.99.100.221@tcp1:/fs 193G 94G 90G 51% /mnt/fs >> >> [oss000 ~]$ sudo tunefs.lustre --dryrun /dev/sda >> checking for existing Lustre data: found >> >> Read previous values: >> Target: fs-OST0000 >> Index: 0 >> Lustre FS: fs >> Mount type: ldiskfs >> Flags: 0x1002 >> (OST no_primnode ) >> Persistent mount opts: ,errors=remount-ro >> Parameters: mgsnode=10.99.100.221@tcp1 failover.node=10.99.100.152@tcp1 >> ,10.99.100.152@tcp1 >> >> >> Permanent disk data: >> Target: fs-OST0000 >> Index: 0 >> Lustre FS: fs >> Mount type: ldiskfs >> Flags: 0x1002 >> (OST no_primnode ) >> Persistent mount opts: ,errors=remount-ro >> Parameters: mgsnode=10.99.100.221@tcp1 failover.node=10.99.100.152@tcp1 >> ,10.99.100.152@tcp1 >> >> exiting before disk write. >> >> >> [oss000 proc]# cat >> /proc/fs/lustre/osc/fs-OST0000-osc-ffff89c57672e000/import >> import: >> name: fs-OST0000-osc-ffff89c57672e000 >> target: fs-OST0000_UUID >> state: IDLE >> connect_flags: [ write_grant, server_lock, version, request_portal, >> max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, >> alt_checksum_algorithm, fid_is_enabled, version_recovery, grant_shrink, >> full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, >> grant_param, lvb_type, short_io, lfsck, bulk_mbits, second_flags, >> lockaheadv2, increasing_xid, client_encryption, lseek, reply_mbits ] >> connect_data: >> flags: 0xa0425af2e3440078 >> instance: 39 >> target_version: 2.15.3.0 >> initial_grant: 8437760 >> max_brw_size: 4194304 >> grant_block_size: 4096 >> grant_inode_size: 32 >> grant_max_extent_size: 67108864 >> grant_extent_tax: 24576 >> cksum_types: 0xf7 >> max_object_bytes: 17592186040320 >> import_flags: [ replayable, pingable, connect_tried ] >> connection: >> failover_nids: [ 0@lo, 0@lo ] >> current_connection: 0@lo >> connection_attempts: 1 >> generation: 1 >> in-progress_invalidations: 0 >> idle: 36 sec >> rpcs: >> inflight: 0 >> unregistering: 0 >> timeouts: 0 >> avg_waittime: 2627 usec >> service_estimates: >> services: 1 sec >> network: 1 sec >> transactions: >> last_replay: 0 >> peer_committed: 0 >> last_checked: 0 >> >>
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org