pyt...@bdurham.com, 16.12.2010 21:03:
Is text processing with dicts a good use case for Python
cross-compilers like Cython/Pyrex or ShedSkin? (I've read the
cross compiler claims about massive increases in pure numeric
performance).

Cython is generally a good choice for string processing, simply because it can drop a lot of code into plain C, such as character iteration and comparison. Depending on what kind of operations you do, you can get speed-ups of 100x or more for that.

http://docs.cython.org/src/tutorial/strings.html

However, when it comes to dict lookups, it uses CPython's own dicts which are heavily optimised for string lookups already. So the speedup in that area will likely stay below 30%. Similarly, encoding and decoding use Python's codecs, so don't expect a major difference there.


I have 3 use cases I'm considering for Python-to-C++
cross-compilers for generating 32-bit Python extension modules
for Python 2.7 for Windows.

1. Parsing UTF-8 files (basic Python with lots of string
processing and dict lookups)

"Parsing" sounds like something that could easily benefit from Cython compilation.


2. Generating UTF-8 files from nested list/dict structures

That should be much faster in Cython, too, simply because iteration on builtin types is much faster than in Python.


3. Parsing large ASCII "CSV-like" files and using dict's to
calculate simple statistics like running totals, min, max, etc.

Again, parsing will be much faster, especially when reading from raw C files (which would also enable freeing the GIL, in case you want to use multi-threading). The rest may not win that much.

A nice feature of Cython is that you do not have to go low-level right away. You can use all the niceness of Python, and only push the code closer to C level where your benchmarks point you. And if you really have to go all the way down to C, it's just a declaration away.


Are any of these text processing scenarios good use cases for
tools like Cython, Pyrex, or ShedSkin? Are any of these
specifically bad use cases for these tools?

Pyrex isn't worth trying here, simply because you'd have to invest a lot more work to make it as fast as what Cython gives you anyway. ShedSkin may be worth a try, depending on how well you get your ShedSkin module integrated with CPython. (It seems that it has support for building extension modules by now, but I have no idea how well that is fleshed out).


We've tried Psyco and it has sped up some of our parsing
utilities by 200%. But Psyco doesn't support Python 2.7 yet and
we're committed to using Python 2.7 moving forward.

If 3x is not enough for you, I strongly suggest you try Cython. The C code that it generates compiles nicely in all major Python versions, currently from 2.3 to 3.2.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to