This sounds similar to LU-11613: https://jira.whamcloud.com/browse/LU-11613
If so, it was fixed for us by upgrading to 2.10.6. You may be able to work around it by disabling quotas. Regards, Marion > From: Steve Barnet <[email protected]> > To: "[email protected]" <[email protected]> > Date: Wed, 23 Jan 2019 13:48:29 -0600 > Subject: [lustre-discuss] Filesystem started crashing recently > > Hi all, > > Since early last summer, we have been running a 2.10.4 filesystem > pretty much without incident. Then about 2 weeks ago, it started > crashing for no immediately obvious reason. There are no indications > of hardware related problems in the logs, and the work loads have > not changed significantly as far as we can tell. > > I can't rule out hardware or system performance problems, but if > that is the case, there are no obvious pointers as to what those > would be. We had one workload that seemed to trigger the problem > (a couple dozen jobs running du on parts of the filesystem), but > that had been running for months, and even after we killed that > we had a couple crashes. > > Since the first crash (on 7 January) we have experience > these crashes sporadically. Sometimes days between crashes, > other times, hours. > > The symptoms are the filesystem becoming unresponsive, and a > load spike on the MDS and one OSS (we have 8x OSS). The OSS > affected seems to be somewhat random. In the system logs, we see > hung_task timeouts and stack traces, followed shortly by lustre-log > dumps. The only real commonality I have seen is that on the MDS, > the first hung task is in jbd2_journal_commit_transaction. > > To recover the filesystems, I have done e2fsck on the MDT, > and any affected OSTs. They have come back cleanly every time. > > I have attached snippets of the log files at the time of > the most recent crash. > > A high level summary of our system: > > MDS (1x) & OSS (8x) > OS: CentOS Linux release 7.6.1810 (Core) > kernel: 3.10.0-862.2.3.el7_lustre.x86_64 > Lustre: 2.10.4 (ldiskfs) > > Clients: a mix, but predominantly CentOS 7.x running 2.10.4 > > Any insights would be greatly appreciated. There are lots of > logs, so if they would be helpful, I can certainly make them > available. In particular, that first lustre-log is pretty big, > so I just grabbed the lines in closest proximity to the crash. > > Also, if there's a way to get more debugging level > information from lustre, I'm happy to try that as well. > > And I realize this is all at a very high level, so I'll be > happy to provide any additional info needed to help me figure > this out. > > Thanks much for taking the time! > > Best, > > ---Steve > > > > _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
