Dan M wrote: > I'm getting bogged down with backslash escaping. > > I have some text files containing characters with the 8th bit set. These > characters are encoded one of two ways: either "=hh" or "\xhh", where "h" > represents a hex digit, and "\x" is a literal backslash followed by a > lower-case x. > > Catching the first case with a regex is simple. But when I try to write a > regex to catch the second case, I mess up the escaping. > > I took at look at http://docs.python.org/howto/regex.html, especially the > section titled "The Backslash Plague". I started out trying : > > d...@dan:~/personal/usenet$ python > Python 2.7 (r27:82500, Nov 15 2010, 12:10:23) > [GCC 4.3.2] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import re >>>> r = re.compile('\\\\x([0-9a-fA-F]{2})') >>>> a = "This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 > characters \xefn \xeft." >>>> m = r.search(a) >>>> m > > No match. > > I then followed the advice of the above-mentioned document, and expressed > the regex as a raw string: > >>>> r = re.compile(r'\\x([0-9a-fA-F]{2})') >>>> r.search(a) > > Still no match. > > I'm obviously missing something. I spent a fair bit of time playing with > this over the weekend, and I got nowhere. Now it's time to ask for help. > What am I doing wrong here?
What you're missing is that string `a` doesn't actually contain four- character sequences like '\', 'x', 'a', 'a' . It contains single characters that you encode in string literals as '\xaa' and so on. You might do better with p1 = r'([\x80-\xff])' r1 = re.compile (p1) m = r1.search (a) I get at least an <_sre.SRE_Match object at 0xb749a6e0> when I try this. Mel. -- http://mail.python.org/mailman/listinfo/python-list