it is difficult to push the MDS to err in this special way. Is it advisable or 
not to increase the likelihood and frequency of dirfrag operations by tweaking 
some of the parameters mentioned here: If so, what would reasonable 
values be, keeping in mind that we are in a pilot production phase already and 
need to maintain integrity of user data?

Is there any counter showing if such operations happened at all?

> Dear Yan,
> OK, I will try to trigger the problem again and dump the information 
> requested. Since it is not easy to get into this situation and I usually need 
> to resolve it fast (its not a test system), is there anything else worth 
> capturing?


ceph daemon mds.x dump_ops_in_flight
ceph daemon mds.x dump cache /tmp/cachedump.x

> I will get back as soon as it happened again.
> In the meantime, I would be grateful if you could shed some light on the 
> following questions:
> - Is there a way to cancel an individual operation in the queue? It is a bit 
> harsh to have to fail an MDS for that.


> - What is the fragmentdir operation doing in a single MDS setup? I thought 
> this was only relevant if multiple MDS daemons are active on a file system.

It splits large directory to smaller parts.

> > [...]
> > This time I captured the MDS ops list (log output does not really contain 
> > more info than this list). It contains 12 ops and I will include it here in 
> > full length (hope this is acceptable):
> >
> Your issues were caused by stuck internal op fragmentdir.  Can you
> dump mds cache and send the output to us?
