[issue12014] str.format parses replacement field incorrectly

Ben Wolfson Fri, 06 May 2011 13:48:19 -0700

Ben Wolfson <wolf...@gmail.com> added the comment:

Here's my use case.


I'm writing a python version of the ruby library HighLine for CLI interaction, 
to be called, uncreatively, PyLine. One of the moderately neat things about the 
library is that it allows for color information to be embedded in the strings 
one passes to its methods, so, if h is a HighLine object, you could say:

h.say "<%= color('this will be red', :red) %> but this won't be"

So I wanted to be able to provide some kind of similar facility and realized 
that the __getitem__ method supported by format(), along with some 
__getattribute__ trickery, would work: so if p is a PyLine object, you could 
say:

p.say("{colors.red.bold.on_black[this will be bold with red text on a black 
background]} but this will be just be regular text")

Thus:

>>> effectize_string("{colors.red.bold.on_black[this will be bold with red text 
>>> on a black background]} but this will just be regular text")
'\x1b[31m\x1b[1m\x1b[40mthis will be bold with red text on a black 
background\x1b[0m but this will just be regular text\x1b[0m'

Obviously, I'll already have to watch out for stray "]"s in the string passed 
to the object's __getitem__, so you might think, well, it's not much more work 
to also have to watch out for stray ":", "!", "}", and "{" (but, oddly I won't 
need to watch out for *match* "{" and "}"!).

But it's obvious that something here should change. For one thing, as it 
stands, the documentation is wrong; it is not the case that an index_string can 
contain any character except ']'. But the documentation describes the way 
things rationally ought to be; there's a good reason not to allow a ']' in the 
index_string (and one can see why simplicity suggests not allowing for 
*escapes*, though I think that ideally there would be an escaping mechanism). 
But there's no reason not to allow stray "{", "}", ":", and "!" in the 
index_string. The only reason it's true at this point that "it doesn't know 
they're in an index field when it's doing the parsing for ':' or '!'" is that 
(assuming one takes the grammar in the documentation to be accurate) the parser 
is written incorrectly.

It contains, for instance, incorrect comments (in string_format.h:parse_field):
<code>
    /* Search for the field name.  it's terminated by the end of
       the string, or a ':' or '!' */
    field_name->ptr = str->ptr;
    while (str->ptr < str->end) {
        switch (c = *(str->ptr++)) {
        case ':':
        case '!':
            break;
        default:
            continue;
        }
        break;
    }
</code>

(hopefully &lt;code&gt; does the right thing here...)

That's the culprit for the mishandling of ":" and "!", but it is simply not the 
case---again, according to the grammar given in the documentation---that the 
field name can be delimited this way, in two ways.*

And, given that no nested expansion is done in the field_name part of the 
replacement, there's no real reason to retain the present parsing strategy; 
none of !, :, {, or } has any semantic significance in this part of of the 
replacement string, so why should the parsing code treat them specially? 
Surely, even if you think my use case is not so great, there's value in doing 
it right.

The ":" and "!" problem is not super hard to get around. Witness the following 
dirty hack:

<code>
void 
advance_beyond_field(SubString *str)
{
    if (str->ptr > str->end) return;
    switch (*++str->ptr) {
    case '[':
        while(str->ptr < str->end && *(str->ptr) != ']') 
            str->ptr++;
        advance_beyond_field(str);
        break;
    case '.':
        while(str->ptr < str->end)
            switch(*++str->ptr) {
            case ':':
            case '!':
                str->ptr--;
                return;
            case '[':
                advance_beyond_field(str);
                str->ptr--;
                break;
            default:
                continue;
            }
        break;
    default:
        return;
    }       
}
</code>
Followed by replacing the switch statement as above thus:
<code>
        switch (c = *(str->ptr++)) {
        case '.':
        case '[':
            str->ptr -= 2;
            advance_beyond_field(str);
            continue;
        case ':':
        case '!':
            break;
        default:
            continue;
        }
</code>
Of course, there is already in the FieldNameIterator plumbing a more certain 
mechanism for actually getting the fields out.

Then one can do this:

>>> "{0[:]}".format({":":4})
'4'
>>> "{0[{ : ! }]}".format({"{ : ! }":4})
'4'

(One can also pass such formatting-exercising test suites as test_nntplib, 
test_string, and test_collections.)

Though still not this:

>>> "{0[{]}".format({"{":4})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unmatched '{' in format

Even though the stray "{" in the square brackets has no semantic significance, 
it still gets picked up; the culprit is apparently in MarkupIterator_next, 
whose initial bracket-detecting while loop is not square-bracket aware. One 
couldn't just add a "case '[':" to its switch statement because '[' could be a 
fill character, but since a fill character can only occur after a ':', and a 
field_name *can't* occur after a ':' or a '!' a flag for whether a '[' is 
significant could presumably get around that. Something like (not even remotely 
tested):
<code>
        case '[':
            if (bracket_significant)
                while(self->str.ptr < self->str.end && *self->str.ptr != ']') {
                    self->str.ptr++;
                }
            continue;
        case ':':
        case '!':
            bracket_significant = 0;
</code>
bracket_significant having been initialized to 1.

If something like the above works, then it seems to me that it would take a 
very small benefit to outweigh the effort necessary to do this right.

* the second way I don't really care about: the grammar identifies an 
"attribute_name" as an identifier, but in fact any string will work and will be 
passed to getattr: 
>>> "{0.4}".format(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '4'
>>> "{0.-}".format(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '-'
Neither "4" nor "-" is a valid Python identifier, so the actual parsing code 
disagrees with the documented grammar here as well. Here I think it would be 
better just to change the documented grammar.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12014>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12014] str.format parses replacement field incorrectly

Reply via email to