Fredrik Lundh wrote: > Stefan Behnel wrote: > >>> My take on the API decision in question was always that a file is >>> inherently an XML *document*, while a string is inherently an XML >>> *fragment*. >> >> Not inherently, no. I know some people who do web processing with an XML >> document coming in as a string (from an HTTP request) /.../ > > in which case you probably want to stream the raw XML through the parser > *as it arrives*, to reduce latency (to do that, either parse from a > file-like object, or feed data directly to a parser instance, via the > consumer protocol).
It depends on the abstraction the web framework provides. If it allows you to do that, especially in an event driven way, that's obviously the most efficient implementation (and both ElementTree and lxml support this use pattern just fine). However, some frameworks just pass the request content (such as a POSTed document) in a dictionary or as callback parameters, in which case there's little room for optimisation. > also, putting large documents in a *single* Python string can be quite > inefficient. it's often more efficient to use lists of string fragments. That's a pretty general statement. Do you mean in terms of reading from that string (which at least in lxml is a straight forward extraction of a char*/len pair which is passed into libxml2), constructing that string (possibly from partial strings, which temporarily *is* expensive) or just keeping the string in memory? At least lxml doesn't benefit from iterating over a list of strings and passing it to libxml2 step-by-step, compared to reading from a straight in-memory string. Here are some numbers: $$ cat listtest.py from lxml import etree # a list of strings is more memory expensive than a straight string doc_list = ["<root>"] + ["<a>test</a>"] * 2000 + ["</root>"] # document construction temporarily ~doubles memory size doc = "".join(doc_list) def readlist(): tree = etree.fromstringlist(doc_list) def readdoc(): tree = etree.fromstring(doc) $$ python -m timeit -s 'from listtest import readlist,readdoc' 'readdoc()' 1000 loops, best of 3: 1.74 msec per loop $$ python -m timeit -s 'from listtest import readlist,readdoc' 'readlist()' 100 loops, best of 3: 2.46 msec per loop The performance difference stays somewhere around 20-30% even for larger documents. So, as expected, there's a trade-off between temporary memory size, long-term memory size and parser performance here. Stefan -- http://mail.python.org/mailman/listinfo/python-list