Re: [Pharo-users] Spur images

Benjamin Pollack Sun, 05 Oct 2014 14:53:58 -0700

On Sun, 05 Oct 2014 16:51:25 -0400, Sven Van Caekenberghe <s...@stfx.eu>wrote:

Working with WideStrings is way slower than working with ByteStrings,there is just no way around that. What is especially slow is theautomagic switch from ByteString to WideString, for example in aString>>#streamContents: because a #becomeForward: is involved. If thathappens for every line or every token, that would be crazy.
Apart from that, the tokenisation is not very efficient, #lines is acopy of your whole contents, so is the #split: and #trimmed. Thealgorithm sounds a bit lazy as well, writing it 'on purpose' with an eyefor performance might yield better results.
But I guess this is not really an exercise in optimisation. If it is,you should give us the dataset and code (and maybe runnable python codeas reference), with some comments.

Meh, optimizing the languages maybe, but not the code. The Python codeI'm comparing against (and the "sane" C++ code I keep around, for thatmatter) makes these exact same decisions in the interest of readability.Indeed, the equivalent from my Python code for that line is


    with open(path) as f:
        lines = [[word.strip() for word in line.split(',')] for line in f]

This does avoid loading the whole into RAM just to split it, but isotherwise identical (i.e., strip() allocates an extra string, split()allocates lots of extra strings, and the list comprehension is analogousto the #collect: call, etc.), so it seems a fair comparison to me.

If you're curious, the maximum speed I've been able to get out of acarefully crafted, but not unreadable, C that simply passes around a bunchof (start, stop) tuples, rather than strings, comes down to about 40ms. Iknow I can get better than that, but I lost interest in rewriting all thestring routines. I figure another 10ms-20ms could come off, at whichpoint it'd finally be as much I/O bound as anything else.


--Benjamin

Re: [Pharo-users] Spur images

Reply via email to