On 18/04/2024 07:41, Juraj Linkeš wrote:
The equivalent /\n\*.+?(?=\n\*|$)/gs (but slightly more optimised) takes
approximately 3*input_length steps to run (according to regex101 at
least). If that's reasonable enough, I can do this:

    iter = re.finditer(input, "\n\*.+?(?=\n\*|$)", re.S)
    return [TestPmdPortInfo.parse(match.group(0)) for match in iter]

Another optimization is artificially adding a `\n*` delimiter at the end
before feeding it to the regex, thus removing the alternative case (|$),
and making it 2*len steps:

    input += "\n*"
    iter = re.finditer(input, "\n\*.+?(?=\n\*)", re.S)
    return [TestPmdPortInfo.parse(match.group(0)) for match in iter]


I like this second one a bit more. How does the performance change if
we try to match four asterisks "\n\****.+?(?=\n\****)"? Four asterisks
shouldn't randomly be in the output as that's basically another
delimited.

The difference is negligible as the regex walks every character anyways – while there is a match. Either \* or \*{4} will result in a match. The problem is that if an attempt of match fails, the regex backtracks, this is where the problem is. Of course if we already matched 4 asterisks, then the backtracking can skip the whole sequence at once (compared to 1), but it's only going to be 3 steps less per **** found. It's a bigger difference if we attempt to match all the asterisks. The lookahead construct also increases backtracking.

In the meantime, still by amending the output, I've got a solution that doesn't perform any look aheads:

  \*{21}.+?\n{2}

instead of treating blocks as they are (\n******<snip>\n), we can add an extra \n at the end and treat the blocks as: ******<snip>\n\n. Basically this assumes that an empty line is the end delimiter. This takes input_length/2 steps!

Of course in reality every \n is \r\n, as I've discovered that when shells are invoked using paramiko, the stream becomes CRLF for some reason I haven't explored. I think this was worth mentioning for everybody, in case surprise carriage returns may reveal disruptive.

And we should document this in the docstring - sample output, then
explain the extra characters and the regex itself. We shouldn't forget
this in the other commit as well (show port stats).

Ack.

Reply via email to