Re: opening a big file

Chas. Owens Sun, 20 Apr 2008 20:29:06 -0700

On Sun, Apr 20, 2008 at 8:55 PM, Mr. Shawn H. Corey
<[EMAIL PROTECTED]> wrote:
> On Sun, 2008-04-20 at 20:22 -0400, Chas. Owens wrote:
>  > No, you obviously don't know how it is implemented.  It seeks to the
>  > end of the file and reads it into a buffer where it searches for line
>  > endings.  It does not read the entire file until you reach the first
>  > line.
>  >
>
>  That's not the point.
>
>  Seeking the end of file means slowing down; because files are not linear
>  in the disk space.  To find the end of file means going through all the
>  segments allocated to the file.  And even in the best of OSes, it means
>  reading a chain of sectors/clusters to find the end of file.  And then
>  work their way backward from that.


In ext2, and most other file systems I am aware of, it is a tree of
blocks.  This means to seek to the end of the file is an O(log n)
operation.  Given 4k blocks, a seek to the end of a 10 gigabyte file
would result in the reading of 22 inodes*.  An inode is around 128
bytes on ext2 so the amount of information that needs to be read is
slightly more than 2k**.  Now a better argument is that these inodes
are scattered around the disk so there is significant read head
movement, but that is not as much of an issue in modern
multiuser/multitasking machines (linear reads tend to be interrupted
anyway by other processes).  Also, some filesystems now keep portions
of the inode tree in memory, in addition to the canonical disk copies,
speeding up the lookup even more.

snip
>  That involves time.
>
>  But since every line has variable line lengths, you cannot predict how
>  much to back up.  In other words, it's a waste of time.
snip

Again, read the code.  It does not read the file a line at a time; it
reads in a large buffer (it looks like the default is 8k) from the
file and then searches for the line endings in that.

snip
>  The fastest way to do this is to read every line into Perl and disregard
>  everything not relevant.
snip

You may want to revisit file system design.  Try this Perl program on
a largish file*** and then tell me which is faster (if you only care
about items near the end of the file).

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw<timeit timestr>;

die "usage: $0 file" unless @ARGV == 1;

my $file = shift;

open my $fh, "<", $file
        or die "could not open $file: $!";

binmode $fh;
$/ = \(4*1024); #give readline an advantage (read 4k instead of a line)
print
    "seek to end of file: ",   timestr(timeit 1, sub { seek $fh, 0, 2 }), "\n",
    "seek to start of file: ", timestr(timeit 1, sub { seek $fh, 0, 0 }), "\n",
    "read entire file: ",      timestr(timeit 1, sub { 1 while <$fh> }),  "\n";

snip
>  This is what Perl was design for.  There are no magic tricks.  You
>  cannot fart before you ate beans.
snip

The next thing you will tell me is that the best possible sorting
algorithm is O(n log n) for its worst case****.  To paraphrase the
bard, there are more things in Computer Science than are dreamt of in
your naive implementations.  Yes, this is what Perl was designed for,
and therefore it gives us the tools (like seek and
File::ReadBackwards) to quickly deal with things in an efficient
manner.  Is it a good idea to read an entire file with
File::ReadBackwards? No.  Is it a good idea to use File::Backwards to
read the last 100m of a 100g file? Yes.

* ln((10 * 1,024 * 1,024) / 4) / ln(2) = 21.3219281
** 22*128 = 2,816
*** I used an Ubuntu iso, about 700m, and got 0 seconds for both seeks
and about 32 wall clock seconds (0.60 usr +  1.92 sys =  2.52 CPU) for
the, cheating,  readline on HFS+ (journaled)
**** see radix sort for an O(n) sorting algorithm:
http://en.wikipedia.org/wiki/Radix_sort

-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: opening a big file

Reply via email to