We encountered this in testing done time ago and already have a bug filed (don't recall the number right now) and should have a patch soonish if not already. The gist of the problem is changelog registration limits (interger type) and some padding resulting in an artificially low limit.
On Thu, Jul 4, 2019, 6:42 AM Matt Rásó-Barnett <[email protected]> wrote: > I just tried out this configuration and was able to reproduce what Scott > saw on 2.12.2. > > I couldn't see a Jira ticket for this though so I've opened one a new > one: https://jira.whamcloud.com/browse/LU-12506 > > Cheers, > -- > Matt Rásó-Barnett > University of Cambridge > > On Wed, May 22, 2019 at 08:02:59AM +0000, Andreas Dilger wrote: > >Scott, if you haven't already done so, it is probably best to file a > >ticket in Jira with the details. Please include the client > >syslog/dmesg as well as a Lustre debug log ("lctl dk /tmp/debug") so > >that the problem can be isolated. > > > >During DNE development we tested with up to 128 MDTs in AWS, but > >haven't tested that many MDTs in some time. > > > >Cheers, Andreas > > > >On May 8, 2019, at 12:28, White, Scott F <[email protected]> wrote: > >> > >> We’ve been testing DNE Phase II and tried scaling the number of > >> MDSes(one MDT each for all of our tests) very high, but when we did > >> that, we couldn’t mount the filesystem on a client. After trial and > >> error, we discovered that we were unable to mount the filesystem when > >> there were 56 MDSes. 55 MDSes mounted without issue, and it appears > >> any number below that will mount. This failure at 56 MDSes was > >> replicable across different nodes being used for the MDSes, all of > >> which were tested with working configurations, so it doesn’t seem to > >> be a bad server. > >> > >> Here’s the error info we saw in dmesg on the client: > >> > >> LustreError: 28880:0:(obd_config.c:559:class_setup()) setup > >> lustre-MDT0037-mdc-ffff95923d31b000 failed (-16) > >> LustreError: 28880:0:(obd_config.c:1836:class_config_llog_handler()) > >> MGCx.x.x.x@o2ib: cfg command failed: rc = -16 > >> Lustre: cmd=cf003 0:lustre-MDT0037-mdc 1:lustre-MDT0037_UUID > >> 2:x.x.x.x@o2ib > >> LustreError: 15c-8: MGCx.x.x.x@o2ib: The configuration from log > >> 'lustre-client' failed (-16). This may be the result of communication > >> errors between this node and the MGS, a bad configuration, or other > >> errors. See the syslog for more information. > >> LustreError: 28858:0:(obd_config.c:610:class_cleanup()) Device 58 not > >> setup > >> Lustre: Unmounted lustre-client > >> LustreError: 28858:0:(obd_mount.c:1608:lustre_fill_super()) Unable to > >> mount (-16) > >> > >> OS: CentOS 7.6.1810 > >> Kernel: 3.10.0-957.5.1.el7.x86_64 > >> Lustre: 2.12.1 > >> Network card: Qlogic InfiniPath_QLE7340 > >> > >> Other things to note for completeness’ sake: this happened with both > >> ldiskfs and zfs backfstypes, and these tests were using files in > >> memory as the backing devices. > >> > >> Is there something I’m missing as to why more than 56 MDSes won’t > >> mount? > >> > >> Thanks, > >> Scott White > >> Scientist, HPC > >> Los Alamos National Laboratory > >> > >> _______________________________________________ > >> lustre-discuss mailing list > >> [email protected] > >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > >Cheers, Andreas > >-- > >Andreas Dilger > >Principal Lustre Architect > >Whamcloud > > > >_______________________________________________ > >lustre-discuss mailing list > >[email protected] > >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
