On Sun, 05 Oct 2014 16:51:25 -0400, Sven Van Caekenberghe <s...@stfx.eu> wrote:

Working with WideStrings is way slower than working with ByteStrings, there is just no way around that. What is especially slow is the automagic switch from ByteString to WideString, for example in a String>>#streamContents: because a #becomeForward: is involved. If that happens for every line or every token, that would be crazy.

Apart from that, the tokenisation is not very efficient, #lines is a copy of your whole contents, so is the #split: and #trimmed. The algorithm sounds a bit lazy as well, writing it 'on purpose' with an eye for performance might yield better results.

But I guess this is not really an exercise in optimisation. If it is, you should give us the dataset and code (and maybe runnable python code as reference), with some comments.

Meh, optimizing the languages maybe, but not the code. The Python code I'm comparing against (and the "sane" C++ code I keep around, for that matter) makes these exact same decisions in the interest of readability. Indeed, the equivalent from my Python code for that line is

    with open(path) as f:
        lines = [[word.strip() for word in line.split(',')] for line in f]

This does avoid loading the whole into RAM just to split it, but is otherwise identical (i.e., strip() allocates an extra string, split() allocates lots of extra strings, and the list comprehension is analogous to the #collect: call, etc.), so it seems a fair comparison to me.

If you're curious, the maximum speed I've been able to get out of a carefully crafted, but not unreadable, C that simply passes around a bunch of (start, stop) tuples, rather than strings, comes down to about 40ms. I know I can get better than that, but I lost interest in rewriting all the string routines. I figure another 10ms-20ms could come off, at which point it'd finally be as much I/O bound as anything else.

--Benjamin

Reply via email to