Dennis Peterson wrote: >Don't scan every file every day - that makes no sense. Just scan files that >have changed since the previous scan (google tripwire and similar tools).
And I replied: >I'll have to think about this, as it's becoming a lot more complicated than I >had expected. After thinking about it, I still have misgivings about not scanning every file every day: a file may not change day-to-day, but new virus signatures are added all the time, and yesterday's file may contain today's newly recognized virus. But, given the time needed to do a full scan, I have had to adopt a policy of scanning only new or changed files. I looked again at Tripwire and its ilk (e.g., Aide): they are very complicated and large, and not really designed for this purpose. Thus, in true Open Source tradition, I have written a surprisingly small Bash script to do the necessary work to determine which files need to be scanned, based on timestamp, size, inode, and hash. This script only needs the utilities that normally come with most Linux (and some *BSD and commercial Unix) systems. The operation of the script is as follows. 1. Traverse a directory tree (using 'find'), and for each regular file, compute its hash and tabulate the hash value along the file's inode, sizes, timestamp and name. Each line of the output file looks as follows (where I=inode, T=timestamp, N=size, B=block count, H=hash value): I:720897 T:799391846 N:1076 B:8 H:6df9f5744e96466c27477819978f07c5dbae671e /samba/Samba/MSDEV/SAMPLES/SDK/WINNT/REGMPAD/MAKEFILE 2. Compare, using 'diff', the new tabulation file (which has been sorted) with the previous tabulation file for the same directory. This gives a file listing the new and changed files in the directory tree (but not the deleted files). 3. Build a temporary directory tree containing links to the set of files to be scanned. The links are symbolic if the installed version of clamscan supports them (see below), otherwise they are hard links (with their attendant limitations). The temporary directory tree has exactly two levels, and is done that way merely to limit the sizes of the temporary directories. 4. Pass the entire temporary directory tree to clamscan to be scanned recursively. (When a lot of files are to be scanned, this is probably the most efficient approach, since there is a rather small limit on the length of a command line, and using clamd involves IPC.) 5. Take the output of clamscan and transform the names of any files that it flagged back to the original file names. This is trivial for symbolic links, but can be time-consuming for hard links (all the generated link-names have to be looked up in the diff file). 6. Clean up by removing the temp directories etc.; rename the transformed clamscan output by appending a timestamp to its name (so you can keep a history). To go along with this script, I made a modified version of the clamscan program which, when given a command line option, will follow symbolic links to files (symlinks to directories are not needed by this script). I have been using these for a while and am relatively satisified. Notes: 1. The script creates files and directories (e.g., clamscan results) whose names must be easy to read and parse and yet correspond to paths. It does this by unambiguously removing slashes and spaces: "/" -> "%.", " " -> "%_" and "%" -> "%%" (rather than by using the ugly URL transform). For example, the path "/Program Files/Killer%App/" would be converted to "%.Program%_Files%.Killer%%App%.". 2. Even if no files have changed, all the files must have their hashes recomputed, which takes a noticeable amount of time. I use the SHA1 hash, as it is cryptographically stronger than the faster MD5 (which means that even a clever virus would find it almost impossible to hide itself inside a previously legitimate file). The script is included below: you should read it and adjust it to your situation. The patches to clamav (0.88.4) follow the script. =============================================================================== #!/bin/bash # This script, when given one or more directories, constructs current lists # of timestamps, sizes, inodes and hashes of each file and compares them # with the previous lists to determine which files have changed and thus # need to be scanned. It then constructs directories of links to those files # and clamscans those directories. (Hardlinks are used if the available # version clamscan doesn't follow symbolic links.) # # Usage is: $0 working-directory directory-1 ... # Copyright (C) 2006 Paul Kosinski <pk[at]iment[dot]com> # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # File characteristics to be output by find -- prefixed to output from file hash PF='I:%-10i T:%-11T@ N:%-11s B:%-8b H:' # Regex for SED -- extract inode and filename from file-hash list RE='^I:\([0-9]\+\)\s\+\([TNBH]:[0-9a-f]\+\s\+\)\+' # Max number of links in each subdir (keeps directories reasonable size) MAXF=1000 # Hash funtion to be used (MD5 is faster than SHA1, but cryptographically weaker) HASH='/usr/bin/sha1sum' # Where clamscan is CLAM='/opt/clamav/bin/clamscan' # Can clamscan follow symlinks (for files): 0 - no; 1 - yes FFSL=1 # Where each utility program is LN='/bin/ln' LS='/bin/ls' MV='/bin/mv' RM='/bin/rm' TR='/usr/bin/tr' CAT='/bin/cat' SED='/usr/bin/sed' DIFF='/usr/bin/diff' ECHO='/bin/echo' EXPR='/usr/bin/expr' FIND='/usr/bin/find' GREP='/usr/bin/grep' NICE='/usr/bin/nice' SORT='/usr/bin/sort' STAT='/usr/bin/stat' TAIL='/usr/bin/tail' MKDIR='/bin/mkdir' # Subr to rename a file according to its timestamp: "foo.bar" -> "foo.bar.060101-123456" function rename_by_timestamp () { if [ -f "$1" ] ; then TS=`$STAT -c '%y' "$1" | $SED "s/\..\+$//; s/^[0-9][0-9]//; s/[-:]//g; s/[0-9][0-9]$//" | $TR ' ' '-'` $MV "$1" "$1.$TS" fi } # Subr to compute the new list of file characteristics: inode, timestamp, bytes, blocks and hash function hash4scan () { echo "*** hash4scan $1" # Convert path to unambiguously remove slashes and spaces ('/' -> '%.', ' ' -> '%_' and '%' -> '%%') F=`$ECHO "$1" | $SED "s/\([/ %]\)/%\1/g" | $TR '/ %' '._%'` $MV $WD/$F.new $WD/$F.old $NICE $FIND "$1" -type f -printf "$PF" -exec $HASH \{\} \; | $NICE $SORT > $WD/$F.new } # Subr to compute list of files which have changed since last time and thus need to be scanned function diff4scan () { echo "*** diff4scan $1" # Convert path to unambiguously remove slashes and spaces ('/' -> '%.', ' ' -> '%_' and '%' -> '%%') F=`$ECHO "$1" | $SED "s/\([/ %]\)/%\1/g" | $TR '/ %' '._%'` $NICE $DIFF -BbN -e $WD/$F.old $WD/$F.new | $NICE $GREP '^I:' | $NICE $SED -e "s/$RE/\1 /" > $WD/$F.diff } # Create directory and first level subdirectories that contain links to files to be scanned function link4scan () { echo "*** link4scan $1" # Convert path to unambiguously remove slashes and spaces ('/' -> '%.', ' ' -> '%_' and '%' -> '%%') F=`$ECHO "$1" | $SED "s/\([/ %]\)/%\1/g" | $TR '/ %' '._%'` $MKDIR "$WD/$F" K='0' J='0' cd $WD/$F $CAT "$WD/$F.diff" | \ while read I G ; do if [ $J -eq 0 ] ; then K=`$EXPR $K + 1` $MKDIR "$WD/$F/$K" cd "$WD/$F/$K" fi J=`$EXPR $J + 1` if [ $FFSL -gt 0 ] ; then $LN -sf "$G" "$J" else $LN -f "$G" "$I" fi if [ $J -ge $MAXF ] ; then J=0 fi done } function clamscan () { echo "*** clam-scan $1" # Convert path to unambiguously remove slashes and spaces ('/' -> '%.', ' ' -> '%_' and '%' -> '%%') F=`$ECHO "$1" | $SED "s/\([/ %]\)/%\1/g" | $TR '/ %' '._%'` if [ $FFSL -gt 0 ] ; then $NICE $CLAM -ri --follow-file-symlinks "$WD/$F" > "$WD/$F.clam" else $NICE $CLAM -ri "$WD/$F" > "$WD/$F.clam" fi } function listvirs () { echo "*** list-virs $1" # Convert path to unambiguously remove slashes and spaces ('/' -> '%.', ' ' -> '%_' and '%' -> '%%') F=`$ECHO "$1" | $SED "s/\([/ %]\)/%\1/g" | $TR '/ %' '._%'` if [ $FFSL -gt 0 ] ; then $CAT "$WD/$F.clam" | $GREP 'FOUND$' | \ while read VF T ; do VF=`echo "$VF" | $SED -e 's/:$//'` VF=`$LS -l "$VF" | $SED -e "s/^.* -> //"` echo "$VF $T" >> "$WD/$F.scan" done else $CAT "$WD/$F.clam" | $GREP 'FOUND$' | $TR '/' ' ' | \ while read A B C D I T ; do I=`echo "$I" | $TR -c '0-9' ' '` $GREP "^$I" "$WD/$F.diff" | \ while read I VF ; do echo "$VF $T" >> "$WD/$F.scan" done done fi $TAIL -9 "$WD/$F.clam" >> "$WD/$F.scan" $RM "$WD/$F.clam" $RM -rf "$WD/$F" rename_by_timestamp "$WD/$F.diff" rename_by_timestamp "$WD/$F.scan" } # main program if [ $# -lt 2 ] ; then echo "Usage is: $0 working-directory directory-1 ..." exit 1 fi # Construct absolute-path working directory to contain file-hash lists, # diff output (i.e. files to be scanned), clamscan output etc. X=`$ECHO "$PWD" | $SED 's|/|\\\\/|g'` WD=`$ECHO "$1" | $SED "s/\/$//" | $SED "s/^\([^/]\)/"$X"\/\1/"` shift # Iterate over directories to be scanned for D in "$@" ; do hash4scan "$D" diff4scan "$D" link4scan "$D" clamscan "$D" listvirs "$D" done =============================================================================== diff -c /src/clamav/clamav-0.88.4/libclamav/clamav.h /src/clamav/clamav-0.88.4/libclamav/clamav.h.orig *** /src/clamav/clamav-0.88.4/libclamav/clamav.h Thu Sep 28 17:46:34 2006 --- /src/clamav/clamav-0.88.4/libclamav/clamav.h.orig Tue Dec 20 14:44:34 2005 *************** *** 76,89 **** #define CL_SCAN_MAILURL 256 #define CL_SCAN_BLOCKMAX 512 - - /* PRK Thu 28 Sep 2006 begin */ - - #define CL_SCAN_FILESYMLINKS 0x10000000 - - /* PRK Thu 28 Sep 2006 end */ - - /* recommended options */ #define CL_SCAN_STDOPT (CL_SCAN_ARCHIVE | CL_SCAN_MAIL | CL_SCAN_OLE2 | CL_SCAN_HTML | CL_SCAN_PE) --- 76,81 ---- =============================================================================== diff -c /src/clamav/clamav-0.88.4/clamscan/clamscan.c /src/clamav/clamav-0.88.4/clamscan/clamscan.c.orig *** /src/clamav/clamav-0.88.4/clamscan/clamscan.c Thu Sep 28 18:00:48 2006 --- /src/clamav/clamav-0.88.4/clamscan/clamscan.c.orig Mon Jan 9 12:46:05 2006 *************** *** 230,240 **** mprintf(" all .cvd and .db[2] files from DIR\n"); mprintf(" --log=FILE -l FILE Save scan report to FILE\n"); mprintf(" --recursive -r Scan subdirectories recursively\n"); - - /* PRK Thu 28 Sep 2006 begin */ - mprintf(" --follow-file-symlinks Follow symlinks to files (only)\n"); - /* PRK Thu 28 Sep 2006 end */ - mprintf(" --remove Remove infected files. Be careful!\n"); mprintf(" --move=DIRECTORY Move infected files into DIRECTORY\n"); #ifdef HAVE_REGEX_H --- 230,235 ---- =============================================================================== diff -c /src/clamav/clamav-0.88.4/clamscan/manager.c /src/clamav/clamav-0.88.4/clamscan/manager.c.orig *** /src/clamav/clamav-0.88.4/clamscan/manager.c Thu Sep 28 17:46:34 2006 --- /src/clamav/clamav-0.88.4/clamscan/manager.c.orig Mon Jan 9 12:46:23 2006 *************** *** 161,179 **** /* set options */ - - - /* PRK Thu 28 Sep 2006 begin */ - - if(optl(opt, "follow-file-symlinks")) - options |= CL_SCAN_FILESYMLINKS; - else - options &= ~CL_SCAN_FILESYMLINKS; - - /* PRK Thu 28 Sep 2006 end */ - - - if(optl(opt, "disable-archive") || optl(opt, "no-archive")) options &= ~CL_SCAN_ARCHIVE; else --- 161,166 ---- =============================================================================== diff -c /src/clamav/clamav-0.88.4/clamscan/options.c /src/clamav/clamav-0.88.4/clamscan/options.c.orig *** /src/clamav/clamav-0.88.4/clamscan/options.c Thu Sep 28 18:01:19 2006 --- /src/clamav/clamav-0.88.4/clamscan/options.c.orig Thu Jun 23 16:03:09 2005 *************** *** 114,124 **** {"tar", 2, 0, 0}, {"tgz", 2, 0, 0}, {"deb", 2, 0, 0}, - - /* PRK Thu 28 Sep 2006 begin */ - {"follow-file-symlinks", 0, 0, 0}, - /* PRK Thu 28 Sep 2006 end */ - {0, 0, 0, 0} }; --- 114,119 ---- =============================================================================== diff -c /src/clamav/clamav-0.88.4/clamscan/treewalk.c /src/clamav/clamav-0.88.4/clamscan/treewalk.c.orig *** /src/clamav/clamav-0.88.4/clamscan/treewalk.c Thu Sep 28 18:26:06 2006 --- /src/clamav/clamav-0.88.4/clamscan/treewalk.c.orig Thu Dec 22 20:16:56 2005 *************** *** 40,63 **** #include "memory.h" #include "output.h" - - int checksymlink(const char *path) - { - struct stat statbuf; - - if(stat(path, &statbuf) == -1) - return -1; - - if(S_ISDIR(statbuf.st_mode)) - return 1; - - if(S_ISREG(statbuf.st_mode)) - return 2; - - return 0; - } - - int treewalk(const char *dirname, struct cl_node *root, const struct passwd *user, const struct optstruct *opt, const struct cl_limits *limits, int options, unsigned int depth) { DIR *dd; --- 40,45 ---- *************** *** 128,138 **** if(treewalk(fname, root, user, opt, limits, options, depth) == 1) scanret++; } else { ! ! /* PRK Thu 28 Sep 2006 begin */ ! if(S_ISREG(statbuf.st_mode) || ((options & CL_SCAN_FILESYMLINKS) && S_ISLNK(statbuf.st_mode) && (checksymlink(fname) == 2))) ! /* PRK Thu 28 Sep 2006 end */ ! scanret += scanfile(fname, root, user, opt, limits, options); } } --- 110,116 ---- if(treewalk(fname, root, user, opt, limits, options, depth) == 1) scanret++; } else { ! if(S_ISREG(statbuf.st_mode)) scanret += scanfile(fname, root, user, opt, limits, options); } } _______________________________________________ http://lurker.clamav.net/list/clamav-users.html