We have been using OCFS2 for a couple years, and have had a number of issues pop up, some of them seem resolved, but we are still concerned because the systems still seems a bit fragile.
Several times we have had various OCFS2 volumes become unresponsive or slow. We have also run into the "wants too many credits" error a few times, which seems to have been fixed by increasing the journal size on the volume causing the issue (might have made the journals bigger than they really need to be (256MB), but I want to avoid the credits problem). The slowness/unresponsiveness issues seemed to have been solved by increasing the cluster size (especially on largish volumes). But, there still a few concerns. The major concern is that when a volume becomes unresponsive, it causes a cascade affect where servers that simply have that volume NFS mounted, but are not using it, will have problems because commands like df will hang on that volume. I know that the nfsserver is trying to return the current freespace for the volume, but cannot get it because the volume is unresponsive. However, I think it would be better if a cached version of the free space could be return instead when the volume is unresponsive. When a server does hang a volume (probably locks) what is the best procedure to find the server that is causing the issue and the root cause of the problem. I have the scanlocks scripts, and have gotten better at determining the which server is the problem and to some extent the program or directory, but, to me it still is not an exact science. Are there any suggestions about the best way to do this. Ideally, it would be nice if I could get the systems to detect this on their own and either fence themselves or reboot. Any help would be appreciated. Thanks, Andy _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users