Roger Leigh wrote: > green wrote: > > Tim Nelson wrote: > > > On occasion, we find that a filesystem error is bad enough that > > > instead of auto{matically|magically} fixing the issue and continuing > > > to boot, the system hangs, needing a root password entered for a > > > manual fsck to be run. > > > > > > My question is thus: How do I prevent that requirement to login and > > > run fsck manually? Is there some parameter that can be set? Or, am I > > > going about this the completely wrong way? > > > > You mentioned the FSCKFIX option; according to rcS(5) man page, > > setting it to "yes" in /etc/default/rcS will do what you want. This > > causes fsck to be run with -y instead of -p which is somewhat > > dangerous but hopefully will in your case successfully repair the > > filesystem.
I always set FSCKFIX=yes in /etc/default/rcS and think that is the best default. > From a usability point of view, there have been many requests > over the years to make FSCKFIX=yes the default. However, from > a safety point of view, this is not fine due to the risk of > unrecoverable data corruption if it does the wrong thing. The problem is that a vanishingly small number of people know how to drive a filesystem debugger and repair a broken filesystem better than the automated tools. Many more people operate headless computers as servers. If you, the reader of this message, are one of the elite souls who can manually fix a corrupted filesystem then that is awesome. But for the rest of us we don't have the needed knowledge and skills to do so. For the larger majority of users the current default setting of FSCKFIX=no is a problem because it will result in a system that won't boot without a human on the console to manually answer yes to the fsck questions. On a desktop you are there and just do it. On a server you need to get on the console. Typically that requires a support request in the simple case if the machine is in a data center. But for many of us it would mean a long drive into another city hosting the system in order to physically touch the hardware, attach a console, and answer yes. For more of us the FSCKFIX=yes setting is a much better default. > We would prefer the admin to take responsibility for any needed > actions prior to fsck (imaging the disc, backups, etc.) I must object to this. I do not personally know anyone in the real world that I could eat lunch with who has the skills to manually repair a corrupted filesystem. I am confident that if it were possible to conduct a fair poll of the readers of debian-user that an extremely small percentage of readers would have the skill to do so. (I am sure that some do. You reading this might be one of those few.) And yet we are all using Debian systems. For the vast majority of us we must count on the automated fsck to repair the filesystem. For most of us if the filesystem needs an fsck then we would answer yes to proceed with the fsck. And if the fsck is unable to do the repair then we would fall back to restoring from backup. I know that I would create a new filesystem and would restore from backup in that case. A backup is always still needed for safety and RAID does not remove the need for it. And so I object to the idea that allowing the admin to choose to fsck or not is giving the admin any real choice. It really isn't much choice. I don't think it is any real choice at all. It feels simply like a way to way to offload and deflect blame. It is now always possible to point fingers at the local admin. "If you lost data then it is your fault because you pushed the button." And yet for most that is the only thing they can do. I myself would push the button. About the only other option would be to take each disk and make a bit copy of each of the disks in the system. Save those copies off. Then if something goes wrong they can send those full bit copies to someone else who has the skills to possibly recover the data. That would certainly always be a safe recipe whenever a system crashes and needs an fsck. So say you have a simple system with two 1T disks in a RAID1 mirror. You would only need two more 1T disks to make a full bit copy of both disks. You would only need an additional system in which to mount those disks onto and to use as a host for making the copy. You only need a few hours of your time to physically pull the disks and do the copy. And to be careful and not make an additional mistake making a simple problem more complicated. With a full copy then you could always repeat the restoration several times using different techniques and definitely increase your odds of success. If you ever need to fsck the system then the safest recipe would be to always do this. But how many people reading debian-user have these resources of extra disks and systems? I could certainly do this for myself but even I consider that too much trouble. I simply run the fsck and expect that 99 times out of a 100 (numbers I just made up) that it would succeed. In that quite unusual case when it is not able to automatically repair it then I would restore from backup. Because restoring from backup in that unusual case is easier than always doing the safest thing of making a bit copy of the disks. And safer because out of 99 times of pulling the disks and mounting and copying them I would almost certainly make a human error and break something else along the way. > and in some setups e.g. software RAID, it's possible we might fsck > parts of an unreconstructed RAID set and totally destroy it. Could you say a few more words about how this might occur? Because I cannot think of a way for this to happen. Certainly I could force it to happen. But forcing it to happen isn't the same thing as it happening accidentally with a normal system configuration and an accident such as a power loss system crash. Me not being able to think of a way for this to happen doesn't mean it can't. I learn something new as often as possible. And so I would like to be educated on how it might happen. Because this does not seem possible given the way Debian is structured and if it is then I would like to understand the failure case. Please say how FSCKFIX=yes might cause a catastrophic loss. > There are quite a few other pros and cons, but that's essentially > the reason for it being opt-in; you take the responsibility for the > small chance it might do rather bad things. Every time we turn power onto a system we take responsibility that something bad might happen. If we don't like it then we can choose not to turn the power on. Of course such a choice isn't a useful choice. Bob
signature.asc
Description: Digital signature