I've been trying out a script (attached) for the last few days, that does something similar to the idea in my previous comment. It's a shell script that can be put in cron.daily and/or called from an @reboot cron job. The script checks each of your LVM-based filesystems in turn, and won't start a new check if it's been going for more than 10 minutes.
The short version of the story is that fsck'ing a snapshot of a live filesystem is possible, but we might want to get at least a little input from LVM or FS developers first. The main problem with this script is that it trips over on temporary files. It's common for programs (via mkstemp(), I think) to create a temporary file, open it, then delete it. The inode that was previously associated with the file continues to exist so long as a file descriptor to it remains open, but when a snapshot of the filesystem is created, the inodes are never removed, so they become orphans. fsck notices this minor problem in the snapshot and flags the filesystem as needing to be checked. Steps to repeat this problem: $ sudo /etc/init.d/mysql start # creates temporary files on my system $ sudo lvcreate -L1024M -s /dev/your-volgroup/your-root-device $ sudo fsck -v -n -f /dev/your-volgroup/lvol0 $ sudo lvremove /dev/your-volgroup/lvol0 fsck should complain about orphaned files. I get this: $ sudo fsck -v -n -f /dev/nautilus/lvol0 fsck 1.40.8 (13-Mar-2008) e2fsck 1.40.8 (13-Mar-2008) Pass 1: Checking inodes, blocks, and sizes Deleted inode 180229 has zero dtime. Fix? no Inodes that were part of a corrupted orphan linked list found. Fix? no Inode 180230 was part of the orphaned inode list. IGNORED. Inode 180231 was part of the orphaned inode list. IGNORED. Inode 180232 was part of the orphaned inode list. IGNORED. Inode 180233 was part of the orphaned inode list. IGNORED. Inode 180251 was part of the orphaned inode list. IGNORED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Inode bitmap differences: -(180229--180233) -180251 Fix? no root: ********** WARNING: Filesystem still has errors ********** 23381 inodes used (8.92%) 518 non-contiguous inodes (2.2%) # of inodes with ind/dind/tind blocks: 2563/15/0 211424 blocks used (40.33%) 0 bad blocks 1 large file 13390 regular files 2902 directories 1258 character device files 4553 block device files 1 fifo 16 links 1216 symbolic links (1137 fast symbolic links) 46 sockets -------- 23382 files To my untrained eye, it looks like this could be argued to be a bug in ext2 or LVM (because they're not deleting inodes properly), or a bug in fsck (because it doesn't have an "errors remain, but who cares?" return code). Alternatively, it could be argued that the fsck script I've written should parse the output of fsck and decide which filesystem errors are really important. I've gone as far as I can go with this idea - if someone with more of a clue is interested, could you suggest the best way of solving this issue? - Andrew
#!/bin/sh # Check filesystems without rebooting, using LVM # Andrew Sayers, 14 August 2008 # [EMAIL PROTECTED] # # This script aims to be FS-agnostic, although it currently calls "tune2fs" in # two places, to reset the mount-count and check-time. # What to tell the user if an error occurs TITLE="Filesystem problem detected" MESSAGE="Your hard disk has a problem, Please reboot your system to fix it" check_filesystem() { # (I think) LVM escapes dashes in volume names by doubling them (--) # The following gets the volume group, even if it has --s in it export VOLDEV="$1" export VOLGROUP=$(echo "$VOLDEV" | sed -e 's/^\(\(\(\([^-]*\)--\)*\)[^-]*\)-\([^-].*\)/\1/' -e 's/--/-/g') \ export VOLUME=$( echo "$VOLDEV" | sed -e 's/^\(\(\(\([^-]*\)--\)*\)[^-]*\)-\([^-].*\)/\5/' -e 's/--/-/g') export BACKUP=$(lvcreate -L1024M -s "/dev/$VOLGROUP/$VOLUME" | cut -d\" -f2) if ERRORS=$(fsck -v -n -f "/dev/$VOLGROUP/$BACKUP" 2>&1) then tune2fs -T now -C 0 "/dev/mapper/$VOLDEV" >/dev/null lvremove -f "/dev/$VOLGROUP/$BACKUP" >/dev/null # Note: in the success case, success isn't reported until after tune2fs has completed # (in case tune2fs fails) touch "/var/cache/fsck/$VOLDEV" logger -p cron.info "snapshot fsck of \"/dev/$VOLGROUP/$VOLUME\" reported a healthy filesystem" else RETURN_VALUE=$? # TODO: check whether $BACKUP has gone away (due to too much FS activity), and handle that somehow # TODO: write a co-operating GUI app to handle messages something like: # notify-send -u critical -t 6000 --category=device.error "$TITLE" "$MESSAGE" # TODO: automatically remove $BACKUP after reboot # Note: in the failure case, failure must be reported before tune2fs has completed # (in case tune2fs fails) cat <<END | mail -s "$TITLE" root $MESSAGE. Once your system has been recovered, please do: lvremove -f "/dev/$VOLGROUP/$BACKUP" fsck returned return value $RETURN_VALUE while scanning /dev/$VOLGROUP/$VOLUME The following errors were reported: $ERRORS END logger -p cron.alert "snapshot fsck of \"/dev/$VOLGROUP/$VOLUME\" reported a damaged filesystem - reboot to fix it" # Force an fsck on the next reboot (for vaguely sane systems) # 16000 is the highest allowed value for -C tune2fs -C 16000 "/dev/mapper/$VOLDEV" >/dev/null fi } # Create the directory that will remember which devices were most-recently checked if ! [ -d /var/cache/fsck ] then mkdir /var/cache/fsck fi # Add files for any block devices that have been created since the last time the program ran cd /dev/mapper for fs in * do # checkable files must have exactly one '-' that isn't part of a doubled '--' # They must also not be swap partitions # Files with no dashes, or more than one dash, are internal LVM files # Files with "-cow" counterparts are copy-on-write snapshots if echo "$fs" | grep -q -- - && \ ! echo "$fs" | grep -q '[^-]-[^-].*[^-]-[^-]' && \ ! swapon -s | grep -q "^/dev/mapper/$fs" && \ ! [ -e "$fs-cow" ] then # Newly created filesystems must have been created since the last run of this script if [ ! -e "/var/cache/fsck/$fs" ] then touch -d "$(date -d '-1 day')" "/var/cache/fsck/$fs" fi fi done cd - >/dev/null # Delete files for any block devices that have been deleted since the last time the program ran cd /var/cache/fsck for fs in * do if ! [ -e "/dev/mapper/$fs" ] then rm -f "$fs" fi done cd - >/dev/null # Find the least-recently-fsck'd filesystem. # Use the directory itself as the default # # A more intelligent solution might be to see which FS is nearest to its # max-mount-count/interval-time, but that would be hard and FS-specific #check_filesystem "$(ls -r -t /var/cache/fsck/ | head -1)" # fsck all files, in order of which was least-recently checked # If this takes longer than 10 minutes, it finishes the current FS then quits STOP_DATE=$(date -d "+10 minutes" +%s) ls -r -t /var/cache/fsck/ | while [ $(date +%s) -lt $STOP_DATE ] && read "VOLUME" ; do check_filesystem "$VOLUME" ; done
-- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss