[ python-Bugs-676346 ] String formatting operation Unicode problem.

SourceForge.net Mon, 10 Jan 2005 19:55:04 -0800

Bugs item #676346, was opened at 2003-01-28 17:59
Message generated for change (Comment added) made by facundobatista
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470


Category: Unicode
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 3
Submitted By: David M. Grimes (dmgrime)
Assigned to: M.-A. Lemburg (lemburg)
Summary: String formatting operation Unicode problem.

Initial Comment:
When performing a string formatting operation using %s
and a unicode argument, the argument evaluation is
performed more than once.  In certain environments (see
example) this leads to excessive calls.

It seems Python-2.2.2:Objects/stringobject.c:3394 is
where PyObject_GetItem is used (for dictionary-like
formatting args).  Later, at :3509, there is a&quot;goto
unicode&quot; when a string argument is actually unicode. 
At this point, everything resets and we do it all over
again in PyUnicode_Format.

There is an underlying assumption that the cost of the
call to PyObject_GetItem is very low (since we're going
to do them all again for unicode).  We've got a
Python-based templating system which uses a very simple
Mix-In class to facilitate flexible page generation. 
At the core is a simple __getitem__ implementation
which maps calls to getattr():

class mixin:
    def __getitem__(self, name):
        print '%r::__getitem__(%s)' % (self, name)
        hook = getattr(self, name)
        if callable(hook):
            return hook()
        else:
            return hook

Obviously, the print is diagnostic.  So, this basic
mechanism allows one to write hierarchical templates
filling in content found in &quot;%(xxxx)s&quot; escapes with
functions returning strings.  It has worked extremely
well for us.

BUT, we recently did some XML-based work which
uncovered this strange unicode behaviour.  Given the
following classes:

class w1u(mixin):
    v1 = u'v1'

class w2u(mixin):
    def v2(self):
        return '%(v1)s' % w1u()

class w3u(mixin):
    def v3(self):
        return '%(v2)s' % w2u()

class w1(mixin):
    v1 = 'v1'

class w2(mixin):
    def v2(self):
        return '%(v1)s' % w1()

class w3(mixin):
    def v3(self):
        return '%(v2)s' % w2()

And test case:

print 'All string:'
print '%(v3)s' % w3()
print

print 'Unicode injected at w1u:'
print '%(v3)s' % w3u()
print


As we can see, the only difference between the w{1,2,3}
and w{1,2,3}u classes is that w1u defines v1 as unicode
where w1 uses a &quot;normal&quot; string.

What we see is the string-based one shows 3 calls, as
expected:

All string:
&lt;__main__.w3 instance at 0x8150524&gt;::__getitem__(v3)
&lt;__main__.w2 instance at 0x814effc&gt;::__getitem__(v2)
&lt;__main__.w1 instance at 0x814f024&gt;::__getitem__(v1)
v1

But the unicode causes a tree-like recursion:

Unicode injected at w1u:
&lt;__main__.w3u instance at 0x8150524&gt;::__getitem__(v3)
&lt;__main__.w2u instance at 0x814effc&gt;::__getitem__(v2)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w2u instance at 0x814effc&gt;::__getitem__(v2)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w3u instance at 0x8150524&gt;::__getitem__(v3)
&lt;__main__.w2u instance at 0x814effc&gt;::__getitem__(v2)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w2u instance at 0x814effc&gt;::__getitem__(v2)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
&lt;__main__.w1u instance at 0x814f024&gt;::__getitem__(v1)
v1

I'm sure this isn't a &quot;common&quot; use of the string
formatting mechanism, but it seems that evaluating the
arguments multiple times could be a bad thing.  It
certainly is for us 8^)

We're running this on a RedHat 7.3/8.0 setup, not that
it appears to matter (from looking in stringojbect.c).
Also appears to still be a problem in 2.3a1.

Any comments?  Help?  Questions?


----------------------------------------------------------------------

Comment By: Facundo Batista (facundobatista)
Date: 2005-01-11 00:54

Message:
Logged In: YES 
user_id=752496

Please, could you verify if this problem persists in Python 2.3.4
or 2.4?

If yes, in which version? Can you provide a test case?

If the problem is solved, from which version?

Note that if you fail to answer in one month, I'll close this bug
as "Won't fix".

Thank you! 

.    Facundo

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-28 19:23

Message:
Logged In: YES 
user_id=38388

I don't see how you can avoid fetching the Unicode
argument a second time without restructuring the
formatting code altogether.

If you know that your arguments can be Unicode, you
should start with a Unicode formatting string to begin
with. That's faster and doesn't involve a fallback
solution.

If you still want to see this fixed, I'd suggest to submit
a patch.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-676346 ] String formatting operation Unicode problem.

Reply via email to