Am 09.08.2017 um 18:07 schrieb Junio C Hamano:
> René Scharfe <[email protected]> writes:
>
>>> In the face of unreliable segfaults we need to reverse our strategy,
>>> I think. Searching for something not in the buffer (e.g. "1") and
>>> considering matches and segfaults as confirmation that the bug is
>>> still present should avoid any false positives. Right?
>>
>> And that's not it either. *sigh*
>>
>> If the 4097th byte is NUL or LF then we can only hope its access
>> triggers a segfault -- there is no other way to distinguish the
>> result from a legitimate match when limiting with REG_STARTEND. So
>> we have to accept this flakiness.
>
>> We can check the value of that byte with [^0] and interpret a
>> match as failure, but that adds negation and makes the test more
>> complex.
>>
>> ^0*$ would falsely match if the 4097th byte (and possibly later
>> ones) is 0. We need to make sure we check for end-of-line after
>> the 4096th byte, not later.
>
> I do not have a strong objection against "^0{64}{64}$", but let me
> make sure that I understand your rationale correctly.
>
> We assume that we do not have an unmapped guard page to reliably
> trap us. So "^0{64}{64}$" will report a false success when that
> 4097th byte is either NUL or LF. There is 2/256 probability of that
> happening but we accept it.
>
> A pattern "0$" will give us a false success in the same case, but
> the reason why it is worse is because in addition, we get a false
> success if that second page begins with "0\0" or "0\n". The chance
> of that happening is additional 2/256 * 1/256. Worse yet, the page
> could start with "00\0" or "00\n", which is additional 2/256 *
> 1/65536. We could have yet another "0" at the beginning of that
> second page, which only increases the chance of us getting a false
> sucess.
I demonstrated a lack of logical thinking and now you bring on
probabilities!? ;-)
There could be any characters except NUL and LF between the 4096 zeros
and "0$" for the latter to match wrongly, no? So there are 4095
opportunities for the misleading pattern in a page, with probabilities
like this:
0$ 1/256 * 2/256
.0$ 254/256 * 1/256 * 2/256
..0$ (254/256)^2 * 1/256 * 2/256
.{3}0$ (254/256)^3 * 1/256 * 2/256
.{4094}0$ (254/256)^4094 * 1/256 * 2/256
That sums up to ca. 1/256 (did that numerically). Does that make
sense?
> So we are saying that we accept ~1/100 false success rate, but
> additional ~1/30000 is unacceptable.
>
> I do not know if I buy that argument, but I do think that additional
> false success rate, even if it is miniscule, is unnecessary. So as
> long as everybody's regexp library is happy with "^0{64}{64}$",
> let's use that.
The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD.
It doesn't accept consecutive repetition operators without them. We
could use "^(00000000000000000000000000000000){128}$" instead if
there's a problem with that.
René