[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-02-05 Thread Matthew Barnett
Matthew Barnett added the comment: Python 2.6 does (and probably Python 3.x, although I haven't checked): >>> u"\N{LATIN CAPITAL LETTER A}" u'A' If it's good enough for Python's Unicode string literals

[issue5165] os.rename and other raise WindowsError

2009-02-06 Thread Matthew Barnett
Matthew Barnett added the comment: WindowsError is a subclass of OSError, so it's not entirely contradictory, just a little misleading... :-) -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/i

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-02-07 Thread Matthew Barnett
Matthew Barnett added the comment: issue2636-features-2.diff is based on Python 2.6. Bugfix. No new features. Added file: http://bugs.python.org/file12974/issue2636-features-2.diff ___ Python tracker <http://bugs.python.org/issue2

[issue1721518] Small case which hangs

2009-02-08 Thread Matthew Barnett
Matthew Barnett added the comment: This problem has been addressed in issue #2636. Although the extra checks certainly aren't foolproof, neither of the examples given are slow. -- nosy: +mrabarnett ___ Python tracker <http://bugs.py

[issue1448325] re search infinite loop

2009-02-08 Thread Matthew Barnett
Matthew Barnett added the comment: This problem has been addressed in issue #2636. Although the extra checks certainly aren't foolproof, some regular expressions which were slow won't be any more. -- nosy: +mrabarnett ___ Python trac

[issue1566086] RE (regular expression) matching stuck in loop

2009-02-09 Thread Matthew Barnett
Matthew Barnett added the comment: This problem has been addressed in issue #2636. Extra checks have been added to reduce the amount of backtracking. -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue1566

[issue1662581] the re module can perform poorly: O(2**n) versus O(n**2)

2009-02-09 Thread Matthew Barnett
Matthew Barnett added the comment: This has been addressed in issue #2636. -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue1662581> ___ ___

[issue1662581] the re module can perform poorly: O(2**n) versus O(n**2)

2009-02-09 Thread Matthew Barnett
Matthew Barnett added the comment: The new code includes some extra checks which, although not foolproof, certainly reduce the amount of backtracking in a lot of cases. ___ Python tracker <http://bugs.python.org/issue1662

[issue47023] re.sub shows key error on regex escape chars provided in repl param

2022-03-17 Thread Matthew Barnett
Matthew Barnett added the comment: I'd just like to point out that to a user it could _look_ like a bug, that an error occurred while reporting, because the traceback isn't giving a 'clean' report; the stuff about the KeyError i

[issue47081] Replace "qualifiers" with "quantifiers" in the re module documentation

2022-03-21 Thread Matthew Barnett
Matthew Barnett added the comment: I don't think it's a typo, and you could argue the case for "qualifiers", but I still agree with the proposal as it's a more meaningful term in the context. -- ___ Python tracker

[issue47152] Reorganize the re module sources

2022-04-04 Thread Matthew Barnett
Matthew Barnett added the comment: For reference, I also implemented .regs in the regex module for compatibility, but I've never used it myself. I had to do some investigating to find out what it did! It returns a tuple of the spans of the groups. Perhaps I might have used it if it d

[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Matthew Barnett
Matthew Barnett added the comment: You wrote "the u had already been removed by hand". By removing the u in the _Python 2_ code, you changed that string from a Unicode string to a bytestring. In a bytestring, \u is not an escape; b"\u" == b"\\u".

[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread Matthew Barnett
Matthew Barnett added the comment: A numeric escape of 3 digits is an octal (base 8) escape; the octal escape "\100" gives the same character as the hexadecimal escape "\x40". In a replacement template, you can use "\g<100>" if you want group 100 becau

[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread Matthew Barnett
Matthew Barnett added the comment: If we did decide to remove it, but there was still a demand for octal escapes, then I'd suggest introducing \oXXX. -- ___ Python tracker <https://bugs.python.org/is

[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-27 Thread Matthew Barnett
Matthew Barnett added the comment: Suppose you had a pattern: .* It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually. Now suppose you had a pattern: (?:)* On each iteration of the * it

[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Matthew Barnett
Matthew Barnett added the comment: It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct. As to have it should behave: re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a') Iteration 1. Match

[issue43535] Make str.join auto-convert inputs to strings.

2021-03-19 Thread Matthew Barnett
Matthew Barnett added the comment: I'm also -1, for the same reason as Serhiy gave. However, if it was opt-in, then I'd be OK with it. -- nosy: +mrabarnett ___ Python tracker <https://bugs.python.o

[issue43714] re.split(), re.sub(): '\Z' must consume end of string if it matched

2021-04-03 Thread Matthew Barnett
Matthew Barnett added the comment: Do any other regex implementations behave the way you want? In my experience, there's no single "correct" way for a regex to behave; different implementations might give slightly different results, so if the most common ones behave a ce

[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett
Matthew Barnett added the comment: The case: ' a b c '.split(maxsplit=1) == ['a', 'b c '] suggests that empty strings don't count towards maxsplit, otherwise it would return [' a b c '] (i.e. the split would give ['', ' a

[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett
Matthew Barnett added the comment: The best way to think of it is that .split() is like .split(' '), except that it's splitting on any whitespace character instead of just ' ', and keepempty is defaulting to False instead of True. Therefore: ' x y z

[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett
Matthew Barnett added the comment: We have that already, although it's spelled: ' x y z'.split(maxsplit=1) == ['x', 'y z'] because the keepempty option doesn't exist yet. -- ___ Python trac

[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-21 Thread Matthew Barnett
Matthew Barnett added the comment: I've only just realised that the test cases don't cover all eventualities: none of them test what happens with multiple spaces _between_ the letters, such as: ' a b c '.split(maxsplit=1) == ['a', 'b c '] Com

[issue44699] Simple regex appears to take exponential time in length of input

2021-07-21 Thread Matthew Barnett
Matthew Barnett added the comment: It's called "catastrophic backtracking". Think of the number of ways it could match, say, 4 characters: 4, 3+1, 2+2, 2+1+1, 1+3, 1+2+1, 1+1+2, 1+1+1+1. Now try 5 characters... -- ___ Python

[issue45155] Add default arguments for int.to_bytes()

2021-09-13 Thread Matthew Barnett
Matthew Barnett added the comment: I'd probably say "In the face of ambiguity, refuse the temptation to guess". As there's disagreement about the 'correct' default, make it None and require either "big" or "little" if lengt

[issue45155] Add default arguments for int.to_bytes()

2021-09-13 Thread Matthew Barnett
Matthew Barnett added the comment: I wonder whether there should be a couple of other endianness values, namely, "native" and "network", for those cases where you want to be explicit about it. If you use "big" it's not clear whether that's because you

[issue45461] UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string

2021-10-13 Thread Matthew Barnett
Matthew Barnett added the comment: It can be shortened to this: buffer = b"a" * 8191 + b"\\r\\n" with open("bug_csv.csv", "wb") as f: f.write(buffer) with open("bug_csv.csv", encoding="unicode_escape", newline="") as

[issue45539] Negative lookaround assertions sometimes leak capture groups

2021-10-21 Thread Matthew Barnett
Matthew Barnett added the comment: It's definitely a bug. In order for the pattern to match, the negative lookaround must match, which means that its subexpression mustn't match, so none of the groups in that subexpression have captured. -- versions: +P

[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

2021-11-22 Thread Matthew Barnett
Matthew Barnett added the comment: For comparison, the regex module says that 0x1C..0x1F aren't whitespace, and the Unicode property White_Space ("\p{White_Space}" in a pattern, where supported) also says that they ar

[issue45899] NameError on if clause of class-level list comprehension

2021-11-25 Thread Matthew Barnett
Matthew Barnett added the comment: It's not just in the 'if' clause: >>> class Foo: ... a = ['a', 'b'] ... b = ['b', 'c'] ... c = [b for x in a] ... Traceback (most recent call last): File "", line 1, i

[issue38764] Deterministic globbing.

2019-11-11 Thread Matthew Barnett
Matthew Barnett added the comment: I could also add: would sorting be case-sensitive or case-insensitive? Windows is case-insensitive, Linux is case-sensitive. -- nosy: +mrabarnett ___ Python tracker <https://bugs.python.org/issue38

[issue38974] using filedialog.askopenfilename() freezes python 3.8

2019-12-04 Thread Matthew Barnett
Matthew Barnett added the comment: I've just tried it on Windows 10 with Python 3.8 64-bit and Python 3.8 32-bit without issue. -- nosy: +mrabarnett ___ Python tracker <https://bugs.python.org/is

[issue39436] Strange behavior of comparing int and float numbers

2020-01-23 Thread Matthew Barnett
Matthew Barnett added the comment: Python floats have 53 bits of precision, so ints larger than 2**53 will lose their lower bits (assumed to be 0) when converted. -- nosy: +mrabarnett resolution: -> not a bug ___ Python tracker <

[issue39436] Strange behavior of comparing int and float numbers

2020-01-23 Thread Matthew Barnett
Change by Matthew Barnett : -- stage: -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue39436> ___ ___ Python-bugs-list

[issue38826] Regular Expression Denial of Service in urllib.request.AbstractBasicAuthHandler

2020-03-03 Thread Matthew Barnett
Matthew Barnett added the comment: A smaller change to the regex would be to replace the "(?:.*,)*" with "(?:[^,]*,)*". I'd also suggest using a raw string instead: rx = re.compile(r'''(?:[^,]*,)*[ \t]*([^ \t]+)[ \t]+realm=(["']?)(

[issue40027] re.sub inconsistency beginning with 3.7

2020-03-20 Thread Matthew Barnett
Matthew Barnett added the comment: Duplicate of Issue39687. See https://docs.python.org/3/library/re.html#re.sub and https://docs.python.org/3/whatsnew/3.7.html#changes-in-the-python-api. -- resolution: -> duplicate stage: -> resolved status: open -&g

[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-22 Thread Matthew Barnett
Matthew Barnett added the comment: The documentation is talking about whether it'll match at the current position in the string. It's not a bug. -- resolution: -> not a bug ___ Python tracker <https://bugs.pytho

[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-26 Thread Matthew Barnett
Matthew Barnett added the comment: That's what searching does! Does the pattern match here? If not, advance by one character and try again. Repeat until a match is found or you've reached the end. -- ___ Python tracker <https://bu

[issue42668] re.escape does not correctly escape newlines

2020-12-17 Thread Matthew Barnett
Matthew Barnett added the comment: In a regex, putting a backslash before any character that's not an ASCII-range letter or digit makes it a literal. re.escape doesn't special-case control characters. Its purpose is to make a string that might contain metacharacters into on

[issue42871] Regex compilation crashed if I change order of alternatives under quantifier

2021-01-08 Thread Matthew Barnett
Matthew Barnett added the comment: It's not a crash. It's complaining that you're referring to group 2 before defining it. The re module doesn't support forward references to groups, but only backward references to them. -- __

[issue42871] Regex compilation crashed if I change order of alternatives under quantifier

2021-01-08 Thread Matthew Barnett
Matthew Barnett added the comment: Example 1: ((a)|b\2)* ^^^ Group 2 ((a)|b\2)* ^^ Reference to group 2 The reference refers backwards to the group. Example 2: (b\2|(a))* ^^^ Group 2 (b\2|(a))* ^^ Reference to group 2

[issue43156] Python windows installer has a confusing name - add setup to its name

2021-02-07 Thread Matthew Barnett
Matthew Barnett added the comment: Sorry to bikeshed, but I think it would be clearer to keep the version next to the "python" and the "setup" at the end: python-3.10.0a5-win32-setup.exe python-3.10.0a5-win64-setup.exe

[issue41531] Python 3.9 regression: Literal dict with > 65535 items are one item shorter

2020-08-12 Thread Matthew Barnett
Matthew Barnett added the comment: I think what's happening is that in 'compiler_dict' (Python/compile.c), it's checking whether 'elements' has reached a maximum (0x). However, it's not doing this after incrementing; instead, it's checking before i

[issue41664] re.sub does NOT substitute all the matching patterns when re.IGNORECASE is used

2020-08-29 Thread Matthew Barnett
Matthew Barnett added the comment: The 4th argument of re.sub is 'count', not 'flags'. re.IGNORECASE has the numeric value of 2, so: re.sub(r'[aeiou]', '#', 'all is fair in love and war', re.IGNORECASE) is equivalent to: re.sub(r&#

[issue41764] sub function would not work without the flags but the search would work fine

2020-09-11 Thread Matthew Barnett
Matthew Barnett added the comment: The arguments are: re.sub(pattern, repl, string, count=0, flags=0). Therefore: re.sub("pattern","replace", txt, re.IGNORECASE | re.DOTALL) is passing re.IGNORECASE | re.DOTALL as the count, not the flags. It's in the document

[issue41885] Unexpected behavior re.sub() with raw f-strings

2020-09-29 Thread Matthew Barnett
Matthew Barnett added the comment: Arguments are evaluated first and then the results are passed to the function. That's true throughout the language. In this instance, you can use \g<1> in the replacement string to refer to group 1: re.sub(r'([a-z]+)', fr"\g<

[issue42473] re.sub ignores flag re.M

2020-11-26 Thread Matthew Barnett
Matthew Barnett added the comment: Not a bug. Argument 4 of re.sub is the count: sub(pattern, repl, string, count=0, flags=0) not the flags. -- nosy: +mrabarnett resolution: -> not a bug stage: -> resolved status: open -> closed _

[issue42475] wrongly cache pattern by re.compile

2020-11-26 Thread Matthew Barnett
Matthew Barnett added the comment: That behaviour has nothing to do with re. This line: samples = filter(lambda sample: not pttn.match(sample), data) creates a generator that, when evaluated, will use the value of 'pttn' _at that time_. However, you then bind 'pttn

[issue13899] re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

2012-02-03 Thread Matthew Barnett
Matthew Barnett added the comment: This should answer that question: >>> re.findall(r"[\A\C]", r"\AC") ['C'] >>> regex.findall(r"[\A\C]", r"\AC") ['A', 'C

[issue13899] re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

2012-02-04 Thread Matthew Barnett
Matthew Barnett added the comment: In re, "\A" within a character set should be similar to "\C", but instead it's still interpreted as meaning the start of the string. That's definitely a bug. If it doesn't do what it's supposed to do, then it's a

[issue13998] Lookbehind assertions go behind the start position for the match

2012-02-13 Thread Matthew Barnett
Matthew Barnett added the comment: The documentation says of the 'pos' parameter "This is not completely equivalent to slicing the string" and of the 'endpos' parameter "it will be as if the string is endpos characters long". In other words, it st

[issue13169] Regular expressions with 0 to 65536 repetitions raises OverflowError

2012-02-29 Thread Matthew Barnett
Matthew Barnett added the comment: Ideally, it should raise an exception (or a warning) because the behaviour is unexpected. -- ___ Python tracker <http://bugs.python.org/issue13

[issue14212] Segfault when using re.finditer over mmap

2012-03-06 Thread Matthew Barnett
Matthew Barnett added the comment: It segfaults because it attempts to access the buffer of an mmap that has been closed. It would be certainly be more friendly if it checked whether the mmap was still open and, if not, raised an exception instead. -- nosy: +mrabarnett

[issue14212] Segfault when using re.finditer over mmap

2012-03-07 Thread Matthew Barnett
Matthew Barnett added the comment: In the function "getstring" in _sre.c, the code obtains a pointer to the characters of the buffer and then releases the buffer. There's a comment before the release: /* Release the buffer immediately --- possibly dangerous but

[issue14237] Special sequences \A and \Z don't work in character set []

2012-03-09 Thread Matthew Barnett
Matthew Barnett added the comment: Within a character set \A and \Z should behave like, say, \C; in other words, they should be the literals "A" and "Z". -- ___ Python tracker <http://bug

[issue14237] Special sequences \A and \Z don't work in character set []

2012-03-09 Thread Matthew Barnett
Matthew Barnett added the comment: \s matches a character, whereas \A and \Z don't. Within a character set \s makes sense, but \A and \Z don't, so they should be treated as literals. -- ___ Python tracker <http://bugs.python.o

[issue14260] re.groupindex available for modification and continues to work, having incorrect data inside it

2012-03-12 Thread Matthew Barnett
Matthew Barnett added the comment: The re module creates the dict purely for the benefit of the user, and as it's a normal dict, it's mutable. An alternative would to use an immutable dict or dict-like object, but Python doesn't have such a class, and it's probably not wo

[issue14260] re.groupindex available for modification and continues to work, having incorrect data inside it

2012-03-12 Thread Matthew Barnett
Matthew Barnett added the comment: It appears I was wrong. :-( The simplest solution in that case is for it to return a _copy_ of the dict. -- ___ Python tracker <http://bugs.python.org/issue14

[issue1519638] Unmatched Group issue - workaround

2012-03-15 Thread Matthew Barnett
Matthew Barnett added the comment: The replacement can be a callable, so you could do this: re.sub(r'(?:\((?:(\d+)|.*?)\)\s*)+$', lambda m: m.group(1) or '', 'avatar (special edition)') -- ___ Python tracker <ht

[issue14342] In re's examples the example with recursion doesn't work

2012-03-16 Thread Matthew Barnett
Matthew Barnett added the comment: As far as I can tell, back in 2003, changes were made to replace the recursive scheme which used stack allocation with a non-recursive scheme which used heap allocation in order to the improve the behaviour. To me it looks like an oversight and that the

[issue14343] In re's examples the example with re.split() shadows builtin input()

2012-03-16 Thread Matthew Barnett
Changes by Matthew Barnett : -- title: In re's examples the example with re.split() overlaps builtin input() -> In re's examples the example with re.split() shadows builtin input() ___ Python tracker <http://bugs.pytho

[issue14510] Regular Expression "+" perform wrong repeat

2012-04-05 Thread Matthew Barnett
Matthew Barnett added the comment: If a capture group is repeated, as in r'(\$.)+', only its last match is returned. -- ___ Python tracker <http://bugs.python.o

[issue37327] python re bug

2019-06-18 Thread Matthew Barnett
Matthew Barnett added the comment: The problem is the "(?:[^<]+|<(?!/head>))*?". If I simplify it a little I get "(?:[^<]+)*?", which is a repeat within a repeat. There are many ways in which it could match, and if what follows fails to match (it doesn't because there's no "

[issue37687] Invalid regexp should rise exception

2019-07-25 Thread Matthew Barnett
Matthew Barnett added the comment: For historical reasons, if it isn't valid as a repeat then it's a literal. This is true in other regex implementations, and is by no means unique to the re module. -- resolution: -> not a bug stage: -> resolved status

[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Matthew Barnett
Matthew Barnett added the comment: I've just had a look at _uniq, and the code surprises me. The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they? Well, the first line of the function uses 'set', so they are. Why, then, i

[issue30736] Support Unicode 10.0

2017-06-22 Thread Matthew Barnett
Matthew Barnett added the comment: @Steven: Python 3.6 supports Unicode 9. Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> i

[issue30772] If I make an attribute "[a unicode version of B]", it gets assigned to "[ascii B]", and so on.

2017-06-26 Thread Matthew Barnett
Matthew Barnett added the comment: See PEP 3131 -- Supporting Non-ASCII Identifiers It says: """All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.""" >>> import unicodedata >>> un

[issue30802] datetime.datetime.strptime('200722', '%Y%U')

2017-06-29 Thread Matthew Barnett
Matthew Barnett added the comment: Expected result is datetime.datetime(2017, 6, 25, 0, 0). -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue30

[issue30838] re \w does not match some valid Unicode characters

2017-07-03 Thread Matthew Barnett
Matthew Barnett added the comment: In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn). U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So). \w doesn't include Mn, Sm

[issue30838] re \w does not match some valid Unicode characters

2017-07-05 Thread Matthew Barnett
Matthew Barnett added the comment: Python identifiers match the regex: [_\p{XID_Start}]\p{XID_Continue}* The standard re module doesn't support \p{...}, but the third-party "regex" module does. -- ___ Python tracker <http

[issue30927] re.sub() does not work correctly on '.' pattern and \n

2017-07-13 Thread Matthew Barnett
Matthew Barnett added the comment: The 4th parameter is the count, not the flags: sub(pattern, repl, string, count=0, flags=0) >>> re.sub(r'X.', '+', '-X\n-', flags=re.DOTALL) '-+-' -- resolution: ->

[issue30973] Regular expression "hangs" interpreter

2017-07-20 Thread Matthew Barnett
Matthew Barnett added the comment: The regex module is much better in this respect, but it's not foolproof. With this particular example it completes quickly. -- ___ Python tracker <http://bugs.python.org/is

[issue30802] datetime.datetime.strptime('200722', '%Y%U')

2017-07-25 Thread Matthew Barnett
Matthew Barnett added the comment: I think the relevant standard is ISO 8601: https://en.wikipedia.org/wiki/ISO_8601 The first day of the week is Monday. Note particularly the examples it gives: Monday 29 December 2008 is written "2009-W01-1" Sunday 3 January 2010 is wri

[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread Matthew Barnett
Matthew Barnett added the comment: The re module works with codepoints, it doesn't understand canonical equivalence. For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}&q

[issue35538] splitext does not seems to handle filepath ending in .

2018-12-19 Thread Matthew Barnett
Matthew Barnett added the comment: It always returns the dot. For example: >>> posixpath.splitext('.blah.txt') ('.blah', '.txt') If there's no extension (no dot): >>> posixpath.splitext('blah') ('blah', 

[issue35546] String formatting produces incorrect result with left-aligned zero-padded format

2018-12-20 Thread Matthew Barnett
Matthew Barnett added the comment: A similar issue exists with centring: >>> format(42, '^020') '0420' -- nosy: +mrabarnett ___ Python tracker <ht

[issue35645] Alarm usage

2019-01-03 Thread Matthew Barnett
Matthew Barnett added the comment: @Steven: The complaint is that the BEL character ('\a') doesn't result in a beep when printed. @Siva: These days, you shouldn't be relying on '\a' because it's not always supported. If you want to make a beep, do so with

[issue35653] All regular expression match groups are the empty string

2019-01-03 Thread Matthew Barnett
Matthew Barnett added the comment: Look at the spans of the groups: >>> import re >>> re.search(r'^(?:(\d*)(\D*))*$', "42AZ").span(1) (4, 4) >>> re.search(r'^(?:(\d*)(\D*))*$', "42AZ").span(2) (4, 4) They're telling you

[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread Matthew Barnett
Matthew Barnett added the comment: It looks like a bug in re to me. -- ___ Python tracker <https://bugs.python.org/issue35859> ___ ___ Python-bugs-list mailin

[issue35859] Capture behavior depends on the order of an alternation

2019-01-30 Thread Matthew Barnett
Matthew Barnett added the comment: It matches, and the span is (0, 2). The only way that it can match like that is for the capture group to match the 'a', and the final 'b' to match the 'b'. Therefore, re.search(r'(ab|a)*b', 'ab').groups() s

[issue35155] Clarify Protocol Handlers in urllib.request Docs

2019-02-12 Thread Matthew Barnett
Matthew Barnett added the comment: You could italicise the "protocol" part using asterisks, like this: *protocol*_request or this: *protocol*\ _request depending on the implementation of the rst software. -- nosy: +mrabarnett ___ Pyth

[issue17441] Do not cache re.compile

2017-03-07 Thread Matthew Barnett
Matthew Barnett added the comment: If we were doing it today, maybe we wouldn't cache them, but, as you say, it's been like that for a long time. (The regex module also caches them, because the re module does.) Unless someone can demonstrate that it's a problem, I'd say ju

[issue29977] re.sub stalls forever on an unmatched non-greedy case

2017-04-04 Thread Matthew Barnett
Matthew Barnett added the comment: A slightly shorter form: /\*(?:(?!\*/).)*\*/ Basically it's: match start while not match end: consume character match end If the "match end" is a single character, you can use a negated character set, for exa

[issue30133] Strings that end with properly escaped backslashes cause error to be thrown in re.search/sub/etc. functions.

2017-04-21 Thread Matthew Barnett
Matthew Barnett added the comment: Yes, the second argument is a replacement template, not a literal. This issue does point out a different problem, though: re.escape will add backslashes that will then be treated as literals in the template, for example: >>> re.sub(r'a', r

[issue30133] Strings that end with properly escaped backslashes cause error to be thrown in re.search/sub/etc. functions.

2017-04-21 Thread Matthew Barnett
Matthew Barnett added the comment: The function solution does have a larger overhead than a literal. Could the template be made more accepting of backslashes without breaking anything? (There's also issue29995 "re.escape() escapes too much", wh

[issue30148] Pathological regex behaviour

2017-04-23 Thread Matthew Barnett
Matthew Barnett added the comment: If 'ignores' is '', you get this: (?:\b(?:extern|G_INLINE_FUNC|%s)\s*) which can match an empty string, and it's tried repeatedly. That's inadvisable. There's also: (?:\s+|\*)+ which can match whitespace in mul

[issue30157] csv.Sniffer.sniff() regex error

2017-04-25 Thread Matthew Barnett
Matthew Barnett added the comment: There are 4 patterns. They try to determine the delimiter and quote by looking for matches. Each pattern supposedly covers one of 4 cases: 1. Delimiter, quote, value, quote, delimiter. 2. Start of line/text, quote, value, quote, delimiter. 3. Delimiter

[issue30209] some UTF8 symbols

2017-04-29 Thread Matthew Barnett
Matthew Barnett added the comment: IDLE uses tkinter, which wraps tcl/tk. Versions up to tcl/tk 8.6 can't handle 'astral' codepoints. See also: Issue #30019: IDLE freezes when opening a file with astral characters Issue #21084: IDLE can't deal with characters above the

[issue36397] re.split() incorrectly splitting on zero-width pattern

2019-03-21 Thread Matthew Barnett
Matthew Barnett added the comment: >From the docs: """If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.""" The pattern does contain a capture, so that's why

[issue36397] re.split() incorrectly splitting on zero-width pattern

2019-03-23 Thread Matthew Barnett
Matthew Barnett added the comment: The list alternates between substrings (s, between the splits) and captures (c): ['1', '1', '2', '2', '11'] -s- -c- -s- -c- -s-- You can use slicing to extract the substrings: >>> re.split

[issue32308] Replace empty matches adjacent to a previous non-empty match in re.sub()

2019-04-11 Thread Matthew Barnett
Matthew Barnett added the comment: It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer(). -- ___ Python tracker <https://bugs.python.org/is

[issue32308] Replace empty matches adjacent to a previous non-empty match in re.sub()

2019-04-12 Thread Matthew Barnett
Matthew Barnett added the comment: Consider re.findall(r'.{0,2}', 'abcde'). It finds 'ab', then continues where it left off to find 'cd', then 'e'. It can also find ''; re.match(r'.*', '') does match, aft

[issue36653] Dictionary Key is without ' ' quotes

2019-04-17 Thread Matthew Barnett
Matthew Barnett added the comment: That should be: def __repr__(self): return repr(self.name) Not a bug. -- resolution: -> not a bug stage: -> resolved status: open -> closed ___ Python tracker <https://bug

[issue36468] Treeview: wrong color change

2019-05-16 Thread Matthew Barnett
Matthew Barnett added the comment: I've just come across the same problem. For future reference, adding the following code before using a Treeview widget will fix the problem: def fixed_map(option): # Fix for setting text colour for Tkinter 8.6.9 # From: https://core.tcl.tk/tk

[issue32982] Parse out invisible Unicode characters?

2018-03-02 Thread Matthew Barnett
Matthew Barnett added the comment: For the record, '\u200e' is '\N{LEFT-TO-RIGHT MARK}'. -- nosy: +mrabarnett ___ Python tracker <https://bug

[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Matthew Barnett
Matthew Barnett added the comment: You shouldn't assume that just because it takes a long time on one implementation that it'll take a long time on all of the others, because it's sometimes possible to include additional checks to reduce the problem. (I doubt you could elimin

[issue31759] re wont recover nor fail on runaway regular expression

2017-10-13 Thread Matthew Barnett
Matthew Barnett added the comment: @Tim: the regex module includes some extra checks to reduce the chance of excessive backtracking. In the case of the OP's example, they seem to be working. However, it's difficult to know when adding such checks will help, and your example is one

[issue31803] Remove not portable time.clock(), replaced by time.perf_counter() and time.process_time()

2017-10-17 Thread Matthew Barnett
Matthew Barnett added the comment: @Victor: True, people often ignore DeprecationWarning anyway, but that's their problem, at least you can say "well, you were warned". They might not have read the documentation on it recently because they have not felt the need to read

[issue31856] Unexpected behavior of re module when VERBOSE flag is set

2017-10-23 Thread Matthew Barnett
Matthew Barnett added the comment: Your verbose examples put the pattern into raw triple-quoted strings, which is OK, but their first character is a backslash, which makes the next character (a newline) an escaped literal whitespace character. Escaped whitespace is significant in a verbose

[issue31969] re.groups() is not checking the arguments

2017-11-08 Thread Matthew Barnett
Matthew Barnett added the comment: @Narendra: The argument, if provided, is merely a default. Checking whether it _could_ be used would not be straightforward, and raising an exception if it would never be used would have little, if any, benefit. It's not a bug, and it's not wort

[issue25054] Capturing start of line '^'

2017-12-02 Thread Matthew Barnett
Matthew Barnett added the comment: The pattern: \b|:+ will match a word boundary (zero-width) before colons, so if there's a word followed by colons, finditer will find the boundary and then the colons. You _can_ get a zero-width match (ZWM) joined to the start of a nonzero-width

<    1   2   3   4   5   6   >