On Wed, Jun 16, 2010 at 10:37 AM, <berg...@panix.com> wrote using a quoting
style so confusing that I'm just not going to quote any of it.

Bergman is right in that if you can seek() deep into the file and then find
the next \n you can continue from there.  There may be better ways to find
where to seek:

1.  If the lines contain dates (like a log file) and are in chronological
order, you could seek to byte (line number) * (average line length).  Seek
backwards in 8K chunks if you over-shot.

2.  If the lines don't contain dates, but you read the same file multiple
times, you can remember the byte location on the first run and seek there
next time.

3.  Often due to formatting and fixed-length lines you know there is a
minimum line length.  seek to byte (line number) * (minimum line length) and
read from there.

However, how big are these files?  If your average line length is 80 and
your example is true (10000th line), then you are avoiding (80 * 10000 /
4096) =  195 block reads (plus meta data, etc).

Is saving ~200 block reads significant on this system?  A decent hard drive
does about 30 random blocks per second, minus some if it is an old system
and plus some if they are linear.  So this is 6 seconds saved.  (do a real
benchmark).

6 seconds saved once a day isn't worth it.  If you are doing this 100 times
an hour, it is worth it.  Though, if you are doing this 100 times an hour it
might be better to get the files to be generated differently.

Tom

P.S.  Code tips:
1.  "skipping to the next \n" doesn't have to be a for loop of readchar()s.
 There's a library call that does it for you already, called fgets().  It
even stores the partial line in a buffer just in case you want it. :-)  The
perl equivalent is $string = <FILEHANDLE>;
2.  Wouldn't it be nice if you could open a file, seek deep into it, then
exec() a new command that would use the file from where you left off?  You
can.  Remember that when you exec*() a command the new command inherits the
file handles of the parent.  You can open a file, seek() to where you want,
then use dup2(FILEHANDLE, 0) to close stdin and connect the file to stdin.
 Now when you exec*() the file has stdin positioned at the middle of that
file.  As long as the new command accepts input on stdin, you're done.  Now
you have a generic command that can be reused many ways:
    seek_then_exec 10000 cat            # output starting with the 10000th
line.
    seek_then_exec 10000 'awk ...'      # awk starting with the 10000th
line.
etc.
_______________________________________________
Discuss mailing list
Discuss@lopsa.org
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to