Ignoring XML Namespaces with cElementTree

2010-04-27 Thread dmtr
Is there any way to configure cElementTree to ignore the XML root
namespace?  Default cElementTree (Python 2.6.4) appears to add the XML
root namespace URI to _every_ single tag.  I know that I can strip
URIs manually, from every tag, but it is a rather idiotic thing to do
(performance wise).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-29 Thread dmtr
I'm referring to xmlns/URI prefixes. Here's a code example:
 from xml.etree.cElementTree import iterparse
 from cStringIO import StringIO
 xml = """http://www.very_long_url.com";>"""
 for event, elem in iterparse(StringIO(xml)): print event, elem

The output is:
 end http://www.very_long_url.com}child' at 0xb7ddfa58>
 end http://www.very_long_url.com}root' at 0xb7ddfa40>


I don't want these "{http://www.very_long_url.com}"; in front of my
tags.

They create performance disaster on large files (first cElementTree
adds them, then I have to remove them in python). Is there any way to
tell cElementTree not to mess with my tags? I need that in the
standard python distribution, not my custom cElementTree build...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-30 Thread dmtr
> I think that's your main mistake: don't remove them. Instead, use the fully
> qualified names when comparing.
>
> Stefan

Yes. That's what I'm forced to do. Pre-calculating tags like tagChild
= "{%s}child" % uri and using them instead of "child". As a result the
code looks ugly and there is extra overhead concatenating/comparing
these repeating and redundant prefixes. I don't understand why
cElementTree forces users to do that. So far I couldn't find any way
around that without rebuilding cElementTree from source.

Apparently somebody hard-coded the namespace_separator parameter in
the cElementTree.c (what a dumb thing to do!!!, it should have been a
parameter in the cElementTree.XMLParser() arguments):
===
self->parser = EXPAT(ParserCreate_MM)(encoding, &memory_handler, "}");
===

Simply replacing "}" with NULL gives me desired tags without stinking
URIs.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-04-30 Thread dmtr
Here's a link to the patch exposing this parameter: 
http://bugs.python.org/issue8583
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ignoring XML Namespaces with cElementTree

2010-05-01 Thread dmtr
> Unless you have multiple namespaces or are working with defined schema
> or something, it's useless boilerplate.
>
> It'd be a nice feature if ElementTree could let users optionally
> ignore a namespace, unfortunately it doesn't have it.


Yep. Exactly my point. Here's a link to the patch addressing this:
http://bugs.python.org/issue8583
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parser

2010-05-02 Thread dmtr
On May 2, 12:54 pm, Andreas Löscher  wrote:
> Hi,
> I am looking for an easy to use parser. I am want to get an overview
> over parsing and want to try to get some information out of a C-Header
> file. Which parser would you recommend?

ANTLR
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parser

2010-05-02 Thread dmtr
>
> > ANTLR
>
> I don't know if it's that easy to get started with though. The
> companion for-pay book is *most excellent*, but it seems to have been
> written to the detriment of the normal online docs.
>
> Cheers,
> Chris
> --http://blog.rebertia.com


IMO ANTLR is much easier to use compared to any other tool simply
because it has excellent GUI (the quality of which is amazing).
-- 
http://mail.python.org/mailman/listinfo/python-list


A python interface to google-sparsehash?

2010-05-04 Thread dmtr
Anybody knows if a python sparsehash module is there in the wild?
-- 
http://mail.python.org/mailman/listinfo/python-list


An empty object with dynamic attributes (expando)

2010-06-03 Thread dmtr
How can I create an empty object with dynamic attributes? It should be
something like:

>>> m = object()
>>> m.myattr = 1

But this doesn't work. And I have to resort to:

>>> class expando(object): pass
>>> m = expando()
>>> m.myattr = 1

Is there a one-liner that would do the thing?

-- Cheers, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: getting MemoryError with dicts; suspect memory fragmentation

2010-06-03 Thread dmtr
On Jun 3, 3:43 pm, "Emin.shopper Martinian.shopper"
 wrote:
> Dear Experts,
>
> I am getting a MemoryError when creating a dict in a long running
> process and suspect this is due to memory fragmentation. Any
> suggestions would be welcome. Full details of the problem are below.
>
> I have a long running processing which eventually dies to a
> MemoryError exception. When it dies, it is using roughly 900 MB on a 4
> GB Windows XP machine running Python 2.5.4. If I do "import pdb;

Are you sure you have enough memory available?
Dict memory usage can jump x2 during re-balancing.

-- Dmitry

P.S. Wish there was a google-sparsehash port for python
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: getting MemoryError with dicts; suspect memory fragmentation

2010-06-03 Thread dmtr
> I have a long running processing which eventually dies to a
> MemoryError exception. When it dies, it is using roughly 900 MB on a 4
> GB Windows XP machine running Python 2.5.4. If I do "import pdb;

BTW have you tried the same code with the Python 2.6.5?

-- Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: getting MemoryError with dicts; suspect memory fragmentation

2010-06-03 Thread dmtr
I'm still unconvinced that it is a memory fragmentation problem. It's
very rare.
Can you give more concrete example that one can actually try to
execute? Like:

python -c "list([list([0]*xxx)+list([1]*xxx)+list([2]*xxx)
+list([3]*xxx) for xxx in range(10)])" &

-- Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: An empty object with dynamic attributes (expando)

2010-06-04 Thread dmtr
> Why does it have to be a one-liner? Is the Enter key on your keyboard
> broken?

Nah. I was simply looking for something natural and intuitive, like: m
= object(); m.a = 1;
Usually python is pretty good providing these natural and intuitive
solutions.


> You have a perfectly good solution: define a class, then instantiate it.
> But if you need a one-liner (perhaps to win a game of code golf), then
> this will work:
>
> >>> m = type('', (), {})()
> >>> m.attribute = 2

Heh. Creating it dynamically. Ace. ;)

-- Cheers, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: An empty object with dynamic attributes (expando)

2010-06-05 Thread dmtr
Right.

>>> m = lambda:expando
>>> m.myattr = 1
>>> print m.myattr
1

-- Cheers, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: An empty object with dynamic attributes (expando)

2010-06-10 Thread dmtr
On Jun 9, 7:31 pm, a...@pythoncraft.com (Aahz) wrote:
> dmtr   wrote:
>
> >>>> m = lambda:expando
> >>>> m.myattr = 1
> >>>> print m.myattr
> >1
>
> That's a *great* technique if your goal is to confuse people.
> --

Yeah. But it is kinda cute. Let's hope it won't get adapted
(adopted ;).

-- Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


How to print SRE_Pattern (regexp object) text for debugging purposes?

2010-06-17 Thread dmtr
I need to print the regexp pattern text (SRE_Pattern object ) for
debugging purposes, is there any way to do it gracefully? I've came up
with the following hack, but it is rather crude... Is there an
official way to get the regexp pattern text?

>>> import re, pickle
>>> r = re.compile('^abc$', re.I)
>>> r
<_sre.SRE_Pattern object at 0xb7e6a330>

>>> ds = pickle.dumps(r)
>>> ds
"cre\n_compile\np0\n(S'^abc$'\np1\nI2\ntp2\nRp3\n."

>>> re.search("\n\(S'(.*)'\n", ds).group(1)
'^abc$'
>>>

-- Cheers, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to print SRE_Pattern (regexp object) text for debugging purposes?

2010-06-17 Thread dmtr
On Jun 17, 3:35 pm, MRAB  wrote:
>
>  >>> import re
>  >>> r = re.compile('^abc$', re.I)
>  >>> r.pattern
> '^abc$'
>  >>> r.flags
> 2


Hey, thanks. It works.

Couldn't find it in a reference somehow.
And it's not in the inspect.getmembers(r).
Must be doing something wrong.

-- Cheers, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-06 Thread dmtr
I'm running into some performance / memory bottlenecks on large lists.
Is there any easy way to minimize/optimize memory usage?

Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
>>> sys.getsizeof('')  24 bytes
>>> sys.getsizeof('0')25 bytes
>>> sys.getsizeof(u'')28 bytes
>>> sys.getsizeof(u'0')  32 bytes

Lists of str() and unicode() objects (see ref. code below):
>>> [str(i) for i in xrange(0, 1000)]   370 Mb (37 bytes/item)
>>> [unicode(i) for i in xrange(0, 1000)]   613 Mb (63 bytes/item)

Well...  63 bytes per item for very short unicode strings... Is there
any way to do better than that? Perhaps some compact unicode objects?

-- Regards, Dmitry


import os, time, re
start = time.time()
l = [unicode(i) for i in xrange(0, 1000)]
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(l), vm, dt,
len(l) / dt)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-06 Thread dmtr
Steven, thank you for answering. See my comments inline. Perhaps I
should have formulated my question a bit differently: Are there any
*compact* high performance containers for unicode()/str() objects in
Python? By *compact* I don't mean compression. Just optimized for
memory usage, rather than performance.

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).


> Yes, lots of ways. For example, do you *need* large lists? Often a better
> design is to use generators and iterators to lazily generate data when
> you need it, rather than creating a large list all at once.

Yes. I do need to be able to process large data sets.
No, there is no way I can use an iterator or lazily generate data when
I need it.


> An optimization that sometimes may help is to intern strings, so that
> there's only a single copy of common strings rather than multiple copies
> of the same one.

Unfortunately strings are unique (think usernames on facebook or
wikipedia). And I can't afford storing them in db/memcached/redis/
etc... Too slow.


> Can you compress the data and use that? Without knowing what you are
> trying to do, and why, it's really difficult to advise a better way to do
> it (other than vague suggestions like "use generators instead of lists").

Yes. I've tried. But I was unable to find a good, unobtrusive way to
do that. Every attempt either adds some unnecessary pesky code, or
slow, or something like that. See more at: http://bugs.python.org/issue9520


> Very often, it is cheaper and faster to just put more memory in the
> machine than to try optimizing memory use. Memory is cheap, your time and
> effort is not.

Well... I'd really prefer to use say 16 bytes for 10 chars strings and
fit data into 8Gb
Rather than paying extra $1k for 32Gb.

> > Well...  63 bytes per item for very short unicode strings... Is there
> > any way to do better than that? Perhaps some compact unicode objects?
>
> If you think that unicode objects are going to be *smaller* than byte
> strings, I think you're badly informed about the nature of unicode.

I don't think that that unicode objects are going to be *smaller*!
But AFAIK internally CPython uses UTF-8? No? And 63 bytes per item
seems a bit excessive.
My question was - is there any way to do better than that


> Python is not a low-level language, and it trades off memory compactness
> for ease of use. Python strings are high-level rich objects, not merely a
> contiguous series of bytes. If all else fails, you might have to use
> something like the array module, or even implement your own data type in
> C.

Are there any *compact* high performance containers (with dict, list
interface) in Python?

-- Regards, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-06 Thread dmtr
> > Well...  63 bytes per item for very short unicode strings... Is there
> > any way to do better than that? Perhaps some compact unicode objects?
>
> There is a certain price you pay for having full-feature Python objects.

Are there any *compact* Python objects? Optimized for compactness?

> What are you trying to accomplish anyway? Maybe the array module can be
> of some help. Or numpy?

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

-- Regards, Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-06 Thread dmtr
On Aug 6, 10:56 pm, Michael Torrie  wrote:
> On 08/06/2010 07:56 PM, dmtr wrote:
>
> > Ultimately a dict that can store ~20,000,000 entries: (u'short
> > string' : (int, int, int, int, int, int, int)).
>
> I think you really need a real database engine.  With the proper
> indexes, MySQL could be very fast storing and retrieving this
> information for you.  And it will use your RAM to cache as it sees fit.
>  Don't try to reinvent the wheel here.

No, I've tried. DB solutions are not even close in terms of the speed.
Processing would take weeks :(  Memcached or REDIS sort of work, but
they are still a bit on the slow side, to be a pleasure to work with.
The standard dict() container is *a lot* faster. It is also hassle
free (accepting unicode keys/etc). I just wish there was a bit more
compact dict container, optimized for large dataset and memory, not
for speed. And with the default dict() I'm also running into some kind
of nonlinear performance degradation, apparently after
10,000,000-13,000,000 keys. But I can't recreate this with a solid
test case (see  http://bugs.python.org/issue9520 ) :(

-- Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-07 Thread dmtr
On Aug 6, 11:50 pm, Peter Otten <__pete...@web.de> wrote:
> I don't know to what extent it still applys but switching off cyclic garbage
> collection with
>
> import gc
> gc.disable()


Haven't tried it on the real dataset. On the synthetic test it (and
sys.setcheckinterval(10)) gave ~2% speedup and no change in memory
usage. Not significant. I'll try it on the real dataset though.


> while building large datastructures used to speed up things significantly.
> That's what I would try first with your real data.
>
> Encoding your unicode strings as UTF-8 could save some memory.

Yes...  In fact that's what I'm trying now... .encode('utf-8')
definitely creates some clutter in the code, but I guess I can
subclass dict... And it does saves memory! A lot of it. Seems to be a
bit faster too

> When your integers fit into two bytes, say, you can use an array.array()
> instead of the tuple.

Excellent idea. Thanks!  And it seems to work too, at least for the
test code. Here are some benchmarks (x86 desktop):

Unicode key / tuple:
>>> for i in xrange(0, 100): d[unicode(i)] =  (i, i+1, i+2, i+3, i+4, i+5, 
>>> i+6)
100 keys, ['VmPeak:\t  224704 kB', 'VmSize:\t  224704 kB'],
4.079240 seconds, 245143.698209 keys per second

>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] =  
>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
100 keys, ['VmPeak:\t  201440 kB', 'VmSize:\t  201440 kB'],
4.985136 seconds, 200596.331486 keys per second

>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] =  (i, i+1, i+2, 
>>> i+3, i+4, i+5, i+6)
100 keys, ['VmPeak:\t  125652 kB', 'VmSize:\t  125652 kB'],
3.572301 seconds, 279931.625282 keys per second

Almost halved the memory usage. And faster too. Nice.

-- Dmitry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-07 Thread dmtr
Correction. I've copy-pasted it wrong! array.array('i', (i, i+1, i+2, i
+3, i+4, i+5, i+6)) was the best.

>>> for i in xrange(0, 100): d[unicode(i)] =  (i, i+1, i+2, i+3, i+4, i+5, 
>>> i+6)
100 keys, ['VmPeak:\t  224704 kB', 'VmSize:\t  224704 kB'],
4.079240 seconds, 245143.698209 keys per second

>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] =  (i, i+1, i+2, 
>>> i+3, i+4, i+5, i+6)
100 keys, ['VmPeak:\t  201440 kB', 'VmSize:\t  201440 kB'],
4.985136 seconds, 200596.331486 keys per second

>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] =  
>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
100 keys, ['VmPeak:\t  125652 kB', 'VmSize:\t  125652 kB'],
3.572301 seconds, 279931.625282 keys per second

-- Dmitry

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-07 Thread dmtr
> Looking at your benchmark, random.choice(letters) has probably less overhead
> than letters[random.randint(...)]. You might even try to inline it as

Right... random.choice()...  I'm a bit new to python, always something
to learn. But anyway in that benchmark (from http://bugs.python.org/issue9520
) the code that generate 'words' takes 90% of the time. And I'm really
looking at deltas between different methods, not the absolute value. I
was also using different code to get benchmarks for my previous
message... Here's the code:


#!/usr/bin/python
# -*- coding: utf-8  -*-
import os, time, re, array

start = time.time()
d = dict()
for i in xrange(0, 100): d[unicode(i).encode('utf-8')] =
array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(d), vm, dt,
len(d) / dt)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there any way to minimize str()/unicode() objects memory usage [Python 2.6.4] ?

2010-08-07 Thread dmtr
I guess with the actual dataset I'll be able to improve the memory
usage a bit, with BioPython::trie. That would probably be enough
optimization to continue working with some comfort. On this test code
BioPython::trie gives a bit of improvement in terms of memory. Not
much though...

>>> d = dict()
>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] = 
>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

100 keys, ['VmPeak:\t  125656 kB', 'VmSize:\t  125656 kB'],
3.525858 seconds, 283618.896034 keys per second


>>> from Bio import trie
>>> d = trie.trie()
>>> for i in xrange(0, 100): d[unicode(i).encode('utf-8')] = 
>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

100 keys, ['VmPeak:\t  108932 kB', 'VmSize:\t  108932 kB'],
4.142797 seconds, 241382.814950 keys per second
-- 
http://mail.python.org/mailman/listinfo/python-list