On 08/11/2010 07:10 AM, Kris Maglione wrote: > On Wed, Aug 11, 2010 at 04:01:27AM -0700, Robert Ransom wrote: >> On Wed, 11 Aug 2010 06:41:30 -0400 >> Kris Maglione <maglion...@gmail.com> wrote: >> >>> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote: >>>> Hi everyone, >>>> >>>> I was playing around with awk and ran into some surprising data for the >>>> function match(s, r), which returns the position of the first match of >>>> regular expression r in string s. Here's the test I ran: >>>> >>>> $ yes | head -10000 | tr -d '\n' >/tmp/big >>>> $ yes | head -1000000 | tr -d '\n' >/tmp/bigger >>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big >>>> >>>> real 0m0.056s >>>> user 0m0.053s >>>> sys 0m0.000s >>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger >>>> >>>> real 0m5.695s >>>> user 0m5.140s >>>> sys 0m0.553s >>>> >>>> The difference is almost exactly 100x, which is the size difference >>>> between the input files. It seems ridiculous that the amount of time >>>> taken to match the first character of a string grows linearly with the >>>> size of the string. The time it takes to load the contents of the file >>>> does not contribute significantly to this increase. >>> >>> You don't make sense. The second test performs the match two >>> orders of magnitude more times, so it should be two orders of >>> magnitude slower. It's not comparing a single string that is 100 >>> times as long, but rather running the same test 100 times for >>> either 10⁴ or 10⁶ identical strings. >> >> No, he stripped out the newlines with tr. > > He might have, if he used GNU tr. However, in that case, my > results are nowhere as dramatic as his. >
I was using GNU tr. The input files were single lines with 10000 or 1000000 y's, so I was doing 100 matches in each case (from the for loop) on the same line. I guess I should have made that more explicit, sorry. I'm interested in what results you're getting.