Re: UFS Crash and directories now missing

Jerome Herman Mon, 30 Apr 2012 15:44:32 -0700

On 30/04/2012 19:23, Eitan Adler wrote:

On 30 April 2012 07:36, Robert Bonomi<bon...@mail.r-bonomi.com>  wrote:

A competennt, "not stupid", sysadmin would know these things.  And not
'remove all doubt' (in the words of Abraham Lincoln), by raising such
nonsense questions.

A competent sysadmin would ask questions when they don't know the
answer bringing up possibilities they thought about.
A stupid sysadmin would yell at someone asking a question claiming
they should have known the answer.

I must admit that Robert Bonomi tone was highly insulting for this list,and though I completely condemn the form of his post, I cannot say Idisagree with the content.

There are quite a lot of things that are wrong with Alejandro Imass'post and analysis.The fist thing is that he did not give is setup in one go. It took quitea while to figure what happened, what system he was using and how he wasusing it.At first he had to hard reboot an unresponsive system, then at reboot hewould have lost all of his jail.Then it appeared that all the jails where inside another jail and thatthe unresponsiveness came from MySQL.

Then we learn that all his daemons are inside jails.
Then we learn that ftp-proxy is not.
Then we learned that jail are not handled manually but through EZJail.

Then we are told that the problem with MySQL is known and comes from aclient using TigerCRM with a too much data.There are litterally dozens of little pieces of important knowledge allover the thread. And you have to read it all to make sure you have theglobal view. Not really a good start.It is OK to forget to mention a thing or two, discarding what you thinkis irrelevant to the problem at hand, but it is not OK to force peoplewho are trying to help you to read 50+ posts to learn about the basicsof your installation.

What is even more irritating is the fact that Alejandro Imass ignorespretty much anything that would leads toward a human mistake. Most postsimplying a possible bad use of jails/nullfs/ezjail are ignored oranswered by a simple "I have done everything by the book". Now from myexperience someone with 6 servers, each containing multiple jails willnot do everything by the book every time. It might be that Alejandro isexceptional, but it is more likely that at least one if not more ofthese jails were not made "by the book". Nothing to blame anyone inhere, we all get tired/bored/overconfident sometime - but refusing toadmit the very possibility of a human mistake won't help at all infinding a solution. Reading the thread I realized that my suggestionthat he might have over-used "ln" had been discarded as "stupid", butthe information came a lot later in answer to another post. Of course inthe mean time I learned that he was using ezjail, which, if I had knownearlier, would have made me wonder if he had not overused nullfs or ln.He furthermore discarded the possibility saying that he did not thinkthat ezjail was using links, just nullfs. Well too bad ezjail ismassively using links, at least for basejail, and sometime for porttrees or perl setup depending which guide you are using as your reference.During the thread he pretty much bashed anyone who tried to tell himthat no amount of jail/ezjail/nullfs/journal screw up could haveresulted in the entire content of the jails being moved into anothercompletely unrelated directory node. If one jail had moved it wouldalready have been extraordinary, with a probability of it happening socleanly that fsck would find nothing already magnitude of order abovethe chances of winning the national lottery. But all of them ? Not achance. He finally admitted that he had very little knowledge about UFSand fsck, but still managed to do it in a quite offensive way.

That was basically the point were I decided to stop to try to help him.I think others felt the same. This problem is quite interesting initself, and I think a lot of the most talented people on this list wouldhave been on it but were repelled by the attitude.

On the other hand Alejandro Imass pretty much jumped on anything thatwould be a third party interaction. From someone hacking into his box toa potential nullfs bug that might result in a PR.

Now the thing is that EZJail make use of the "system immutable flag"quite a lot for its config file, resulting in quite a lot of file beingimpossible to delete or move unless the box is running atkern_secure_level 0. This renders the whole "jails moved on their own"theory even more improbable.


After so much ranting, I would feel bad not to try to help a little :
Here are the facts :

- In a jail, MySQL was grabbing all the CPU and making the box nonresponsive. This is due to TigerCRM making requests to a too huge database.

        -> The jail was working

-> Unless all the data were in memory at this time(unprobable), it means that access path/nullfs/EZJail were OK at this time.

- After a force reboot all the jails were gone, or more exactly movedinside another jail. fsck saw no error on the disk.-> The disk was in a stable state at reboot, the directory andfile structure was consistent.

- Jails contained it the apache jail were in an OK state and could bearchived and restored-> The data structure of the hard drive was clean, and filescontents were OK.


From all this here is what we can safely assume :

a) The box was not hacked, or at least the hacker did not move the jailsaround, this is confirmed by MySQL working and doing enough I/O to stalethe box from inside a jail that was later seen has moved.b) The hard-reboot did not cause a problem, it revealed it. Since bothfsck run fine and the data were preserved we can pretty safely assumedthat there was no data or system corruption caused by the hard reboot.


Things to investigate :

- When was the last time this box was rebooted normally ? Did it wentfine ? Were the jails created at this time ?- What happens if you deactivate the jail that "survived" and rebootnormally, would the other jail contained in it start ? If you deactivatethe jail but leave the nullfs mapping on and try to restart EZJail ? Dothe other jails start ?- What is the content of the different fstab.* and of the EZJail conf ?Does any of it points inside the jail that survived the reboot ?

Unfortunately since the server was "corrected" and we probably won'thave a satisfying answer. But honestly the probability of a system bugis really low. Very likely the "moved" jails were inside the survivingjail from the beginning, and a mix of nullfs remap and lack of rebootmasked this fact for a while.


_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Re: UFS Crash and directories now missing

Reply via email to