[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread James Davis


New submission from James Davis :

I have two regexes: /(a|ab)*?b/ and /(ab|a)*?b/.
If I re.search the string "ab" for these regexes, I get inconsistent behavior.
Specifically, /(a|ab)*?b/ matches with capture "a", while /(ab|a)*?b/ matches 
with an empty capture group.

I am not actually sure which behavior is correct.


Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? 
once, and the engine proceeds to the b and completes. The capture group ends up 
containing "a".

Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked 
with *, the engine repeats the attempt and finds nothing the second time. It 
proceeds to the b and completes. Because the second match attempt on (ab|a) 
found nothing, the capture group ends up empty.

The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the 
non-greedy quantifier.

I cannot see why changing the order of the alternation should have this effect.

The change in behavior occurs in the built-in "re" module but not in the 
competing "regex" module.
The behavior is consistent in both Python 2.7 and Python 3.5. I have not tested 
other versions.

I have included the confusing-regex-behavior.py file for troubleshooting.

Below is the behavior for matches on these and many variants.
I find the following lines the most striking:

Regex patternmatched?   matched string captured 
content
   

(ab|a)*?bTrue   ab
('',)
(ab|a)+?bTrue   ab
('',)
(ab|a){0,}?b True   ab
('',)
(ab|a){0,2}?bTrue   ab
('',)
(ab|a){0,1}?bTrue   ab   
('a',)
(ab|a)*b True   ab   
('a',)
(ab|a)+b True   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(a|ab)+?bTrue   ab   
('a',)


(08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py 


Behavior from re


Regex patternmatched?   matched string captured 
content
   

(ab|a)*?bTrue   ab
('',)
(ab|a)+?bTrue   ab
('',)
(ab|a){0,}?b True   ab
('',)
(ab|a){0,2}?bTrue   ab
('',)
(ab|a)?b True   ab   
('a',)
(ab|a)??bTrue   ab   
('a',)
(ab|a)b  True   ab   
('a',)
(ab|a){0,1}?bTrue   ab   
('a',)
(ab|a)*b True   ab   
('a',)
(ab|a)+b True   ab   
('a',)
(a|ab)*b True   ab   
('a',)
(a|ab)+b True   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(a|ab)+?bTrue   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(a|ab)*?bTrue   ab   
('a',)
(bb|a)*?bTrue   ab   
('a',)
((?:ab|a)*?)bTrue   ab   
('a',)
((?:a|ab)*?)bTrue   ab   
('a',)


Behavior from regex


Regex patternmatched?   matched string captured 
content
   

(ab|a)*?bTrue   ab   
('a',)
(ab|a)+?b

[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread James Davis


James Davis  added the comment:

Thanks for your thoughts, Raymond. I understand that the alternation has 
"short-circuit" behavior, but I still find it confusing in this case.

Consider these two:


Regex patternmatched?   matched string captured 
content
   

(ab|a)*?bTrue   ab
('',)
(ab|a)+?bTrue   ab
('',)

In order to satisfy the first "(ab|a)+?" clause the regex engine has to find at 
least one match for (ab|a), and still match the final "b" clause of the pattern.

In this case, although "(ab|a)" will match "ab", doing so would cause the 
overall pattern to mismatch. So it seems like in order to obtain the match 
(which it does, see the second column), the regex engine must proceed past the 
first "ab" into the "a" part of the OR. But then I would expect the capture 
group to contain "a" and it does not.

For what it's worth, I also tried the match /(ab|a)*?b/ in PHP, Perl, Java, 
Ruby, Go, Rust and Node.js. The other 7 regex engines all matched "ab" and 
captured "a". Only Python's re module matches with an empty capture -- and even 
here it disagrees with the Python "regex" module as I noted in my initial post.

--

___
Python tracker 
<https://bugs.python.org/issue35859>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32981] Catastrophic backtracking in poplib and difflib

2018-03-01 Thread James Davis

New submission from James Davis :

Hi Python security team,

My name is James Davis. I'm a security researcher at Virginia Tech.

The python core (cpython) has 2 regular expressions vulnerable to catastrophic 
backtracking that look like potential DOS vectors.
The vulnerable expressions are listed below.

Each vulnerability has the following keys, explained in more detail below:
 - pattern
 - filesIn (as of December/January; I excluded any appearances in 
irrelevant-looking dirs, and in '.min' files)
 - stringLenFor10Sec
 - nPumpsFor10Sec
 - attackFormat
 - blowupCurve

The attack format describes how to generate an attack string.
On my machine, an attack string generated using nPumpsFor10Sec repetitions 
("pumps") of the pump string(s)
blocks the python regex engine for 10 seconds, though this will vary based on 
your hardware.

Compose an attack string like this:
  'prefix 1' + 'pump 1' X times + 'prefix 2' + 'pump 2' X times + ... + suffix
Example:
  With pumpPairs: [{'prefix': 'a', 'pump': 'b'}], suffix: 'c', an attack string 
with three pumps would be:
abbbc

Catastrophic backtracking blows up at either an exponential rate or a 
super-linear (power law) rate.
The blowupCurve indicates how severe the blow-up is.
The 'type' is either EXP(onential) or POW(er law) in the number of pumps (x).
The 'parms' are the parameters for the two curve types. The second parameter is 
more important, because:
  EXP: f(x) = parms[0] * parms[1]^x
  POW: f(x) = parms[0] * x^parms[1]

JSON formatted:

Vuln 1:

{
   "attackFormat" : {
  "pumpPairs" : [
 {
"pump" : "]+>)",
   "filesIn" : [
  [
 "Lib/poplib.py"
  ]
   ]
}

Vuln 2:

{
   "blowupCurve" : {
  "parms" : [
 1.31911634447601e-08,
 1.89691808610459
  ],
  "r2" : 0.998387790742004,
  "type" : "POWER"
   },
   "stringLenFor10Sec" : 48328,
   "attackFormat" : {
  "pumpPairs" : [
 {
"pump" : "\t",
"prefix" : "\t"
 }
  ],
  "suffix" : "##"
   },
   "pattern" : "\\s*#?\\s*$",
   "filesIn" : [
  [
 "Lib/difflib.py"
  ]
   ],
   "nPumpsFor10Sec" : "48325"
}

--
components: Library (Lib)
messages: 313119
nosy: davisjam
priority: normal
pull_requests: 5723
severity: normal
status: open
title: Catastrophic backtracking in poplib and difflib
type: security

___
Python tracker 
<https://bugs.python.org/issue32981>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32997] Catastrophic backtracking in fpformat

2018-03-05 Thread James Davis

New submission from James Davis :

The decoder regex used to parse numbers in the fpformat module is vulnerable to 
catastrophic backtracking.

'^([-+]?)0*(\d*)((?:\.\d*)?)(([eE][-+]?\d+)?)$'

The substructure '0*(\d*)' is quadratic.
An attack string like '+0000++' blows up.

There is a risk of DOS (REDOS) if a web app uses this module to format 
untrusted strings.

--
components: Library (Lib)
messages: 313249
nosy: davisjam
priority: normal
severity: normal
status: open
title: Catastrophic backtracking in fpformat
type: security
versions: Python 2.7

___
Python tracker 
<https://bugs.python.org/issue32997>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32997] Catastrophic backtracking in fpformat

2018-03-05 Thread James Davis

Change by James Davis :


--
keywords: +patch
pull_requests: +5750
stage:  -> patch review

___
Python tracker 
<https://bugs.python.org/issue32997>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32997] Catastrophic backtracking in fpformat

2018-03-05 Thread James Davis

James Davis  added the comment:

Equivalent, probably cleaner. Comment on the PR if you want a change.

--

___
Python tracker 
<https://bugs.python.org/issue32997>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com