On Sun, 05 Oct 2014 16:51:25 -0400, Sven Van Caekenberghe <s...@stfx.eu>
wrote:
Working with WideStrings is way slower than working with ByteStrings,
there is just no way around that. What is especially slow is the
automagic switch from ByteString to WideString, for example in a
String>>#streamContents: because a #becomeForward: is involved. If that
happens for every line or every token, that would be crazy.
Apart from that, the tokenisation is not very efficient, #lines is a
copy of your whole contents, so is the #split: and #trimmed. The
algorithm sounds a bit lazy as well, writing it 'on purpose' with an eye
for performance might yield better results.
But I guess this is not really an exercise in optimisation. If it is,
you should give us the dataset and code (and maybe runnable python code
as reference), with some comments.
Meh, optimizing the languages maybe, but not the code. The Python code
I'm comparing against (and the "sane" C++ code I keep around, for that
matter) makes these exact same decisions in the interest of readability.
Indeed, the equivalent from my Python code for that line is
with open(path) as f:
lines = [[word.strip() for word in line.split(',')] for line in f]
This does avoid loading the whole into RAM just to split it, but is
otherwise identical (i.e., strip() allocates an extra string, split()
allocates lots of extra strings, and the list comprehension is analogous
to the #collect: call, etc.), so it seems a fair comparison to me.
If you're curious, the maximum speed I've been able to get out of a
carefully crafted, but not unreadable, C that simply passes around a bunch
of (start, stop) tuples, rather than strings, comes down to about 40ms. I
know I can get better than that, but I lost interest in rewriting all the
string routines. I figure another 10ms-20ms could come off, at which
point it'd finally be as much I/O bound as anything else.
--Benjamin