On Wed, 11 Aug 2010 06:41:30 -0400 Kris Maglione <maglion...@gmail.com> wrote:
> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote: > > Hi everyone, > > > > I was playing around with awk and ran into some surprising data for the > > function match(s, r), which returns the position of the first match of > > regular expression r in string s. Here's the test I ran: > > > > $ yes | head -10000 | tr -d '\n' >/tmp/big > > $ yes | head -1000000 | tr -d '\n' >/tmp/bigger > > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big > > > > real 0m0.056s > > user 0m0.053s > > sys 0m0.000s > > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger > > > > real 0m5.695s > > user 0m5.140s > > sys 0m0.553s > > > > The difference is almost exactly 100x, which is the size difference > > between the input files. It seems ridiculous that the amount of time > > taken to match the first character of a string grows linearly with the > > size of the string. The time it takes to load the contents of the file > > does not contribute significantly to this increase. > > You don't make sense. The second test performs the match two > orders of magnitude more times, so it should be two orders of > magnitude slower. It's not comparing a single string that is 100 > times as long, but rather running the same test 100 times for > either 10⁴ or 10⁶ identical strings. No, he stripped out the newlines with tr. Robert Ransom
signature.asc
Description: PGP signature