On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
Hi everyone,
I was playing around with awk and ran into some surprising data for the
function match(s, r), which returns the position of the first match of
regular expression r in string s. Here's the test I ran:
$ yes | head -10000 | tr -d '\n' >/tmp/big
$ yes | head -1000000 | tr -d '\n' >/tmp/bigger
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
real 0m0.056s
user 0m0.053s
sys 0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
real 0m5.695s
user 0m5.140s
sys 0m0.553s
The difference is almost exactly 100x, which is the size difference
between the input files. It seems ridiculous that the amount of time
taken to match the first character of a string grows linearly with the
size of the string. The time it takes to load the contents of the file
does not contribute significantly to this increase.
You don't make sense. The second test performs the match two
orders of magnitude more times, so it should be two orders of
magnitude slower. It's not comparing a single string that is 100
times as long, but rather running the same test 100 times for
either 10⁴ or 10⁶ identical strings.
Finally, trying Kernighan's One True Awk (from
http://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz)
This is the same as p9p's awk, except perhaps for some relative
minor changes to one or the other over recent years.
So at least nawk's performance makes sense. To make things a little more
confusing, I tried matching on a non-existent pattern:
It does only if the pattern match takes about 1/10th of a
millisecond, which it's beyond the granularity of your shell's
time function to determine.
--
Kris Maglione
Lovers of problem solving, they are apt to play chess at lunch or
doodle in algebra over cocktails, speak an esoteric language that some
suspect is just their way of mystifying outsiders. Deeply concerned
about logic and sensitive to its breakdown in everyday life, they
often annoy friends by asking them to rephrase their questions more
logically.
--Time Magazine in 1965