Re: [racket] help me speed up string split?

Ryan Davis Wed, 18 Jun 2014 14:45:26 -0700

On Jun 18, 2014, at 1:26, Robby Findler <ro...@eecs.northwestern.edu> wrote:


> I think the ruby version is smarter than your Racket version.
> Specifically, if you remove the .size (but don't print the output),
> things slow down by a factor of about 2 for me.
> 
> $  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f)'
>> /dev/null
> real 0m1.127s
> user 0m1.071s
> sys 0m0.054s
> $  time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
> 655360
> 
> real 0m0.561s
> user 0m0.512s
> sys 0m0.047s

It's possible I grabbed the wrong input. Doing the equivalent take(10) results 
in roughly the same time. Either way, ruby isn't using lazy lists or anything 
to cheat in the case of .size. You're probably just seeing a time discrepancy 
for output generation. 

% time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).take(10)'
[0.0, 0.28858451, -0.020294171, -0.13290514, -0.9952786, -0.98311061, 
-0.91352645, -0.99511208, -0.98318457, -0.92352702]

real    0m3.896s
user    0m3.611s
sys     0m0.281s

% time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size'
4124473

real    0m3.886s
user    0m3.612s
sys     0m0.263s

> But meanwhile, it looks like the regexp matching is slower than in
> Ruby. Below are some variations on the code that explore some speedups
> for the computation you intended to happen, I believe. The middle
> version (fetch-whitespace-delimited-string2) takes the same amount of
> time as the version of ruby that is forced to actually compute the
> list. So maybe there are still more improvements to be done to regexp
> port matching, at least for certain regexps, I'm not sure.

The first one works, but takes ~55k ms. :(

I could never get the second one to work. It comes back instantly w/ 0. I think 
my input file doesn't QUITE conform to what you're looking for. I snuck in an 
extra read-char in before the loop, but still no go.

Either way, it's a really painful way to think about this type of task.

"Read it all in, split on whitespace, convert everything to floats"

vs

"Create a function that takes a function, opens the input, sets up a loop, 
calls the given function, compares the result and if it is truthy, appends the 
string converted to a number to the front of a list (which must be later 
reversed). The passed function needs to set up a loop, read in a character, and 
if that character is not eof, space, or newline it loops concatinating that 
character with a list of previously found characters. If eof, or whitespace is 
discovered, it reverses the collected list and converts it to a string."

That reads more like what you'd write in C rather than a beautiful functional 
language with a very complete library. It took me several times longer to 
convert your code to English as it did to write several of these variants in 
ruby, racket, or mathematica. While it is much more nuanced, I just don't think 
it comes as naturally or gracefully as the task as laid out in my head. Even my 
cleanest (and fastest!) versions took several iterations to come by and are a 
bit of a stretch:

        (define path (path->string (expand-user-path "~/Desktop/X_train.txt")))
        
        (define (gc)
          (collect-garbage)(collect-garbage)(collect-garbage))
        
        (define (go fn)
          (time (take (with-input-from-file path fn) 10)))
        
        (gc)
        (go (lambda ()    ; cpu time: 10428 real time: 10410 gc time:  631 
(consistently ~1/2 the gc)
              (for/list ([s (in-port read)])
                s)))
        
        (gc)
        (go (lambda ()    ; cpu time:  9883 real time:  9872 gc time: 1182
              (for/list ([s (in-port (curry read-string 16))])
                (string->number (string-trim s)))))
        
        (gc)
        (go (lambda ()    ; cpu time: 15006 real time: 14981 gc time: 1154
              (for/list ([s (in-port (curry read-string 16))])
                (read (open-input-string s)))))

I would like to think that the regexp engine can be sped up to the point where 
"read it, split it, convert it" is a valid option. Until then, I'll stick to 
for/list + in-port + read, but remembering that that particular combination is 
the right one is going to be difficult.

P.S. Mathematica only takes ~8 seconds and is no speed demon for this type of 
task:

In[2]:= AbsoluteTiming[
 Take[Flatten[Import["~/Desktop/X_train.txt", "Table"]], 10]]

Out[2]= {8.443284, {0.288585, -0.0202942, -0.132905, -0.995279, \
-0.983111, -0.913526, -0.995112, -0.983185, -0.923527, -0.934724}}

more importantly, I wrote that from memory.


____________________
  Racket Users list:
  http://lists.racket-lang.org/users

Re: [racket] help me speed up string split?

Reply via email to