New submission from James Davis <davis...@vt.edu>:

I have two regexes: /(a|ab)*?b/ and /(ab|a)*?b/.
If I re.search the string "ab" for these regexes, I get inconsistent behavior.
Specifically, /(a|ab)*?b/ matches with capture "a", while /(ab|a)*?b/ matches 
with an empty capture group.

I am not actually sure which behavior is correct.


Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? 
once, and the engine proceeds to the b and completes. The capture group ends up 
containing "a".

Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked 
with *, the engine repeats the attempt and finds nothing the second time. It 
proceeds to the b and completes. Because the second match attempt on (ab|a) 
found nothing, the capture group ends up empty.

The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the 
non-greedy quantifier.

I cannot see why changing the order of the alternation should have this effect.

The change in behavior occurs in the built-in "re" module but not in the 
competing "regex" module.
The behavior is consistent in both Python 2.7 and Python 3.5. I have not tested 
other versions.

I have included the confusing-regex-behavior.py file for troubleshooting.

Below is the behavior for matches on these and many variants.
I find the following lines the most striking:

Regex pattern                    matched?       matched string     captured 
content
-------------------- -------------------- -------------------- 
--------------------
(ab|a)*?b                            True                   ab                
('',)
(ab|a)+?b                            True                   ab                
('',)
(ab|a){0,}?b                         True                   ab                
('',)
(ab|a){0,2}?b                        True                   ab                
('',)
(ab|a){0,1}?b                        True                   ab               
('a',)
(ab|a)*b                             True                   ab               
('a',)
(ab|a)+b                             True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)+?b                            True                   ab               
('a',)


(08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py 


Behavior from re


Regex pattern                    matched?       matched string     captured 
content
-------------------- -------------------- -------------------- 
--------------------
(ab|a)*?b                            True                   ab                
('',)
(ab|a)+?b                            True                   ab                
('',)
(ab|a){0,}?b                         True                   ab                
('',)
(ab|a){0,2}?b                        True                   ab                
('',)
(ab|a)?b                             True                   ab               
('a',)
(ab|a)??b                            True                   ab               
('a',)
(ab|a)b                              True                   ab               
('a',)
(ab|a){0,1}?b                        True                   ab               
('a',)
(ab|a)*b                             True                   ab               
('a',)
(ab|a)+b                             True                   ab               
('a',)
(a|ab)*b                             True                   ab               
('a',)
(a|ab)+b                             True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)+?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(bb|a)*?b                            True                   ab               
('a',)
((?:ab|a)*?)b                        True                   ab               
('a',)
((?:a|ab)*?)b                        True                   ab               
('a',)


Behavior from regex


Regex pattern                    matched?       matched string     captured 
content
-------------------- -------------------- -------------------- 
--------------------
(ab|a)*?b                            True                   ab               
('a',)
(ab|a)+?b                            True                   ab               
('a',)
(ab|a){0,}?b                         True                   ab               
('a',)
(ab|a){0,2}?b                        True                   ab               
('a',)
(ab|a)?b                             True                   ab               
('a',)
(ab|a)??b                            True                   ab               
('a',)
(ab|a)b                              True                   ab               
('a',)
(ab|a){0,1}?b                        True                   ab               
('a',)
(ab|a)*b                             True                   ab               
('a',)
(ab|a)+b                             True                   ab               
('a',)
(a|ab)*b                             True                   ab               
('a',)
(a|ab)+b                             True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)+?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(a|ab)*?b                            True                   ab               
('a',)
(bb|a)*?b                            True                   ab               
('a',)
((?:ab|a)*?)b                        True                   ab               
('a',)
((?:a|ab)*?)b                        True                   ab               
('a',)

----------
components: Regular Expressions
files: confusing-regex-behavior.py
messages: 334560
nosy: davisjam, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: Capture behavior depends on the order of an alternation
type: behavior
versions: Python 2.7, Python 3.5
Added file: https://bugs.python.org/file48085/confusing-regex-behavior.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue35859>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to