On 08/11/2010 07:10 AM, Kris Maglione wrote:
> On Wed, Aug 11, 2010 at 04:01:27AM -0700, Robert Ransom wrote:
>> On Wed, 11 Aug 2010 06:41:30 -0400
>> Kris Maglione <maglion...@gmail.com> wrote:
>>
>>> On Wed, Aug 11, 2010 at 06:14:55AM -0400, Joseph Xu wrote:
>>>> Hi everyone,
>>>>
>>>> I was playing around with awk and ran into some surprising data for the  
>>>> function match(s, r), which returns the position of the first match of  
>>>> regular expression r in string s. Here's the test I ran:
>>>>
>>>> $ yes | head -10000 | tr -d '\n' >/tmp/big
>>>> $ yes | head -1000000 | tr -d '\n' >/tmp/bigger
>>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big
>>>>
>>>> real    0m0.056s
>>>> user    0m0.053s
>>>> sys     0m0.000s
>>>> $ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger
>>>>
>>>> real    0m5.695s
>>>> user    0m5.140s
>>>> sys     0m0.553s
>>>>
>>>> The difference is almost exactly 100x, which is the size difference  
>>>> between the input files. It seems ridiculous that the amount of time  
>>>> taken to match the first character of a string grows linearly with the  
>>>> size of the string. The time it takes to load the contents of the file  
>>>> does not contribute significantly to this increase.
>>>
>>> You don't make sense. The second test performs the match two 
>>> orders of magnitude more times, so it should be two orders of 
>>> magnitude slower. It's not comparing a single string that is 100 
>>> times as long, but rather running the same test 100 times for 
>>> either 10⁴ or 10⁶ identical strings.
>>
>> No, he stripped out the newlines with tr.
> 
> He might have, if he used GNU tr. However, in that case, my 
> results are nowhere as dramatic as his.
> 

I was using GNU tr. The input files were single lines with 10000 or 1000000 
y's, so I was doing 100 matches in each case (from the for loop) on the same 
line. I guess I should have made that more explicit, sorry. I'm interested in 
what results you're getting.

Reply via email to