Re: [dev] off topic - awk versions performance comparison

Robert Ransom Wed, 11 Aug 2010 04:01:15 -0700

On Wed, 11 Aug 2010 06:41:30 -0400
Kris Maglione <maglion...@gmail.com> wrote:


> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
> > Hi everyone,
> >
> > I was playing around with awk and ran into some surprising data for the  
> > function match(s, r), which returns the position of the first match of  
> > regular expression r in string s. Here's the test I ran:
> >
> > $ yes | head -10000 | tr -d '\n' >/tmp/big
> > $ yes | head -1000000 | tr -d '\n' >/tmp/bigger
> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
> >
> > real    0m0.056s
> > user    0m0.053s
> > sys     0m0.000s
> > $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
> >
> > real    0m5.695s
> > user    0m5.140s
> > sys     0m0.553s
> >
> > The difference is almost exactly 100x, which is the size difference  
> > between the input files. It seems ridiculous that the amount of time  
> > taken to match the first character of a string grows linearly with the  
> > size of the string. The time it takes to load the contents of the file  
> > does not contribute significantly to this increase.
> 
> You don't make sense. The second test performs the match two 
> orders of magnitude more times, so it should be two orders of 
> magnitude slower. It's not comparing a single string that is 100 
> times as long, but rather running the same test 100 times for 
> either 10⁴ or 10⁶ identical strings.

No, he stripped out the newlines with tr.


Robert Ransom

signature.asc
Description: PGP signature

Re: [dev] off topic - awk versions performance comparison

Reply via email to