On 10/26/2011 03:48 PM, Ross Boylan wrote:
I want to replace every \ and " (the two characters for backslash and
double quotes) with a \ and the same character, i.e.,
\ ->  \\
" ->  \"

I have not been able to figure out how to do that.  The documentation
for re.sub says "repl can be a string or a function; if it is a string,
any backslash escapes in it are processed.That is, \n is converted to a
single newline character, \r is converted to a carriage return, and so
forth. Unknown escapes such as \j are left alone."

\\ is apparently unknown, and so is left as is. So I'm unable to get a
single \.

Here are some tries in Python 2.5.2.  The document suggested the result
of a function might not be subject to the same problem, but it seems to
be.
def f(m):
...    return "\\"+m.group(1)
...
re.sub(r"([\\\"])", f, 'Silly " quote')
'Silly \\" quote'
<SNIP>
re.sub(r"([\\\"])", "\\\\\\1", 'Silly " quote')
'Silly \\" quote'

Or perhaps I'm confused about what the displayed results mean.  If a
string has a literal \, does it get shown as \\?

I'd appreciate it if you cc me on the reply.

Thanks.
Ross Boylan

I can't really help on the regex aspect of your code, but I can tell you a little about backslashes, quote literals, the interpreter, and python.

First, I'd scrap the interpreter and write your stuff to a file. Then test it by running that file. The reason for that is that the interpreter is helpfully trying to reconstruct the string you'd have to type in order to get that result. So while you may have successfully turned a double bacdkslash into a single one, the interpreter helpfully does the inverse, and you don't see whether you're right or not.

Next, always assign to variables, and test those variables on a separate line with the regex. This is probably what your document meant when it mentioned the result of a function.

Now some details about python.

When python compiles/interprets a quote literal, the syntax parsing has to decide where the literal stops, so quotes are treated specially. Sometimes you can sidestep the problem of embedding quotes inside literals by using single quotes on the outside and double inside, or vice versa. As you did on the 'Silly " quote' example.

But the more general way to put funny characters into a quote literal is to escape each with a backslash. So there a bunch of two-character escapes. backslash-quote is how you can put either kind of quote into a literal, regardless of what's being used to delimit it. backslash-n gets a newline, which would similarly be bad syntax. backslash-t and some others are usually less troublesome, but can be surprising. And backslash-backslash represents a single backslash. There are also backslash codes to represent arbitrary characters you might not have on your keyboard. And these may use multiple characters after the backslash.

So write a bunch of lines like
     a = 'this is\'nt a surprise'
     print a

and experiment. Notice that if you use \n in such a string, the print will put it on two lines. Likewise the tab is executed.

Now for a digression. The interpreter uses repr() to display strings. You can experiment with that by doing
     print a
     print repr(a)

Notice the latter puts quotes around the string. They are NOT part of the string object in a. And it re-escapes any embedded funny characters, sometimes differently than the way you entered them.

Now, once you're confident that you can write a literal to express any possible string, try calling your regex.
    print re.sub(a, b, c)

or whatever.

Now, one way to cheat on the string if you know you'll want to put actual backslashes is to use the raw string. That works quite well unless you want the string to end with a backslash. There isn't a way to enter that as a single raw literal. You'd have to do something string like
     a = r"strange\literal\with\some\stuff" + "\\"

My understanding is that no valid regex ends with a backslash, so this may not affect you.

Now there are other ways to acquire a string object. If you got it from a raw_input() call, it doesn't need to be escaped, but it can't have an embedded newline, since the enter key is how the input is completed. If you read it from a file, it doesn't need to be escaped.

Now you're ready to see what other funny requirements regex needs. You will be escaping stuff for their purposes, and sometimes that means your literal might have 4 or even more backslashes in a row. But hopefully now you'll see how to separate the different problems.
--

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to