On 18/04/2024 07:41, Juraj Linkeš wrote:
The equivalent /\n\*.+?(?=\n\*|$)/gs (but slightly more optimised) takes
approximately 3*input_length steps to run (according to regex101 at
least). If that's reasonable enough, I can do this:
    iter = re.finditer(input, "\n\*.+?(?=\n\*|$)", re.S)
    return [TestPmdPortInfo.parse(match.group(0)) for match in iter]

Another optimization is artificially adding a `\n*` delimiter at the end
before feeding it to the regex, thus removing the alternative case (|$),
and making it 2*len steps:

    input += "\n*"
    iter = re.finditer(input, "\n\*.+?(?=\n\*)", re.S)
    return [TestPmdPortInfo.parse(match.group(0)) for match in iter]

I like this second one a bit more. How does the performance change if
we try to match four asterisks "\n\****.+?(?=\n\****)"? Four asterisks
shouldn't randomly be in the output as that's basically another
delimited.
The difference is negligible as the regex walks every character anyways 
– while there is a match. Either \* or \*{4} will result in a match. The 
problem is that if an attempt of match fails, the regex backtracks, this 
is where the problem is. Of course if we already matched 4 asterisks, 
then the backtracking can skip the whole sequence at once (compared to 
1), but it's only going to be 3 steps less per **** found. It's a bigger 
difference if we attempt to match all the asterisks. The lookahead 
construct also increases backtracking.
In the meantime, still by amending the output, I've got a solution that 
doesn't perform any look aheads:
  \*{21}.+?\n{2}

instead of treating blocks as they are (\n******<snip>\n), we can add an extra \n at the end and treat the blocks as: ******<snip>\n\n. Basically this assumes that an empty line is the end delimiter. This takes input_length/2 steps!
Of course in reality every \n is \r\n, as I've discovered that when 
shells are invoked using paramiko, the stream becomes CRLF for some 
reason I haven't explored. I think this was worth mentioning for 
everybody, in case surprise carriage returns may reveal disruptive.
And we should document this in the docstring - sample output, then
explain the extra characters and the regex itself. We shouldn't forget
this in the other commit as well (show port stats).
Ack.

Reply via email to