cp by blocks, with errors

Bryan Henderson Mon, 10 May 2004 00:25:29 -0700

I have two (related) enhancements to cp to propose.  Code is attached.

Enhancement 1: reading by blocks
--------------------------------


Reading and writing of files is typically fastest and most efficient when
you read or write blocks of a certain size.  cp seems to understand this
in that it uses the block size (according to stat()) of the source file
as the buffer size for its copy.

But cp makes only a feeble attempt to read whole blocks.  If a read is
short for any reason (the operating system is free to return less than
the requested size any time it wants), all future reads are out of sync
with the file's blocks.  Also, the size returned by stat() isn't
necessarily the best size.  Modern filesystems are much more complicated
than can be described by a single number like that.

My enhancement is 

  1) add a --readsize option so the user can choose the buffer size.
     It still defaults to the stat() block size.

  2) keep the reads synchronized to a multiple of the read size by
     filling in short reads.  E.g. If a 4K read returns 3K, cp does a
     1K read to resynchronize to a 4K boundary.

This is a minor change to the code -- it just involves replacing the
file read call with a subroutine that loops until it gets it all.


Enhancement 2: handling of unreadable portions of source file
-------------------------------------------------------------

Today, if cp encounters an unreadable stretch of file, it just quits.

I have added two new alternatives, controlled by the new --errors
option.  In both, cp searches ahead in the file until it finds a
readable portion.  With --errors=zero, cp pretends it read zeroes in
place of the unreadable portion.  With --errors=skip, cp pretends the
unreadable bytes just didn't exist; the resulting file is shorter than
the source.

Another new option, --errorgrain, tells how finely cp searches for the
end of the bad area; it is the step size by which cp seeks forward,
trying a read at each step.  Default is 512 bytes.

cp issues warnings at the end of each file copy telling how much data
was lost.

This is mostly just embellishment of the new read subroutine I mentioned
above.  There's also the option handling and the statistics reporting,
but no structural changes to the existing copy logic.

Purpose of error handling
-------------------------

I need this because I have large cpio backup files that sometimes have
media errors.  A single sector is missing here and there from the
file.  I want to copy the file to a good disk and proceed to salvage
all the data inside the backup file that is not affected by the
missing sectors.  cpio can generally resynchronize quite well if data
after the error remains at the same offset, so I use cp --errors=zero.

I can do some of this with dd if I'm desperate, but dd is technically
for a rather lower level job -- directly driving a device driver --
not byte stream files.  It doesn't for example, deal with short reads
in a byte stream way.

I have other backup volumes that contain full images of the original
filesystem; I restore those using cp --archive.  Again, if a single
sector somewhere is bad, I'd rather have the cp complete on all the
other files, and whatever it could save of the ruined file, than have
to manually sort through the thousands of files in that directory and
work around the bad ones.

-- 
Bryan Henderson                                    Phone 408-621-2000
San Jose, California

cperror.patch
Description: Binary data

_______________________________________________
Bug-coreutils mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/bug-coreutils

cp by blocks, with errors

Reply via email to