On Sun, Nov 4, 2012 at 9:18 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Sat, 03 Nov 2012 22:49:07 +0100, Hans Mulder wrote: > Actually, for many applications, the space "savings" may actually be > *costs*, since interning forces Python to hold onto strings even after > they would normally be garbage collected. CPython interns strings that > look like identifiers. It really wouldn't be a good idea for it to > automatically intern every string.
I don't know about that. /* This dictionary holds all interned unicode strings. Note that references to strings in this dictionary are *not* counted in the string's ob_refcnt. When the interned string reaches a refcnt of 0 the string deallocation function will delete the reference from this dictionary. Another way to look at this is that to say that the actual reference count of a string is: s->ob_refcnt + (s->state ? 2 : 0) */ static PyObject *interned; Empirical testing (on a Linux 3.3a0 that I had lying around) showed the process's memory usage drop, but I closed the terminal before copying and pasting (oops). Attempting to recreate in IDLE on 3.2 on Windows. >>> a="$"*1024*1024*256 # Make $$$....$$$ fast! >>> import sys >>> sys.getsizeof(a) # Clearly this is a narrow build 536870942 >>> a="$"*1024*1024*256 --> MemoryError. Blah. This is what I get for only having a gig and a half in this laptop. And I was working with 1024*1024*1024 on the other box. Start over... >>> import sys >>> a="$"*1024*1024*128 >>> b="$"*1024*1024*128 >>> a is b False >>> a=sys.intern(a) >>> b=sys.intern(b) >>> c="$"*1024*1024*128 >>> c=sys.intern(c) Memory usage (according to Task Mangler) goes up to ~512MB when I create a new string (like c), then back down to ~256MB when I intern it. So far so good. >>> del a,b,c Memory usage has dropped to 12MB. Unnecessarily-interned strings don't cost anything. (The source does refer to immortal interned strings, but AFAIK you can't create them in user-level code. At least, I didn't find it in help(sys.intern) which is the obvious place to look.) > You can make your own intern system with a simple dict: > > interned_strings = {} > > Then, for every string you care about, do: > > s = interned_strings.set_default(s, s) > > to ensure you are always working with a single string object for each > unique value. In some applications that will save time at the expense of > space. Doing it manually like this _will_ leak like that, though, unless you periodically check sys.getrefcount and dispose of unreferenced entries. > And there is no need to write "is" instead of "==", because string > equality already optimizes the "strings are identical" case. By using ==, > you don't get into bad habits, you defend against the odd un-interned > string sneaking in, and you still have high speed equality tests. This one I haven't checked the source for, but ISTR discussions on this list about comparison of two unequal interned strings not being optimized, so they'll end up being compared char-for-char. Using 'is' guarantees that the check stops with identity. This may or may not be significant, and as you say, defending against an uninterned string slipping through is potentially critical. ChrisA -- http://mail.python.org/mailman/listinfo/python-list