Sorry, I forgot to paste in the runtime for the APL version. It's approximately *6:30*. That's only 7 times slower than the Lisp version doing the same thing which is not fantastic, but acceptable I suppose.
Here's the source to the Lisp version. As you can see, it tries really hard to replicate the behaviour of the APL version, with all that copying. (defparameter *result* (time (with-open-file (s "apjs492452t1_mrt.txt") (let ((result (make-array '(0 11)))) (loop for line = (read-line s nil) while line do (let ((parts (split-sequence:split-sequence #\Space line :remove-empty-subseqs t)) (new-array (make-array (list (1+ (array-dimension result 0)) 11))) (n (reduce #'* (array-dimensions result)))) (loop for i from 0 below n do (setf (row-major-aref new-array i) (row-major-aref result i))) (loop for i from n repeat 10 for p in parts do (setf (row-major-aref new-array i) (parse-number:parse-number p))) (setf (row-major-aref new-array (+ n 10)) (nth 10 parts)) (setq result new-array))))))) On 18 January 2017 at 03:58, Elias Mårtenson <loke...@gmail.com> wrote: > You are right that doing array allocations in any language is slow, and in > order to get some kind of baseline to compare things to, I implemented the > same algorithm in Lisp, and I did so in the worst possible way: By > completely reallocating the array for each row, and copying the old content > into the new one (code in the end of this message). I did it that way > because that is what the APL code ends up doing, so I didn't want to give > the Lisp program any benefits.This was, as you may imagine, very slow. It > took *51* seconds to load the entire file. > > The APL version that I posted in my previous message takes > > Now, for a solution. Reading the file twice is quite ugly, and I'd rather > avoid doing so. > > I could rewrite the Lisp program to in-place reallocation of the array, > which would speed up the program by several orders of magnitude. The > problem is that there is no mechanism in GNU APL to do the same. I think > adding a new construct to the language to do that would be ugly, but > perhaps there is a way for GNU APL to detect the specific “append to the > end of an array” idiom so that it could use an optimised path that doesn't > require constructing a new array. *Jürgen*, is that even feasible? > > Now, reading a CSV file like this is a very common case, and I am of the > opinion that having a dedicated native function for this purpose would > solve perhaps 75% of use cases where this is an issue. I'll be happy to > implement a flexible function that can do this, which would make GNU APL > much more pleasant to use for data analysis. > > Jürgen, if I were to implement this, would you prefer it to be a > dynamically loaded library, or would you prefer having it as a > quad-function? > > Regards, > Elias > > > On 18 January 2017 at 02:33, Nick Lobachevsky <ibeam2...@gmail.com> wrote: > >> Elias, >> >> In just about any language, iterative string catenation is a real >> performance killer owing to the repeated growing memory allocation >> which just gets bigger with every iteration. >> >> I would restructure the algorithm to work more like this: (sorry, on >> a Windoze box and no APL, so metacode only) >> >> Open the file, read the contents of the entire file, close the file >> Change NL to separator, then partition the whole thing (Z) >> Check to see that the length of the partitioned string divides evenly >> by pattern length >> Create a temporary of only the numeric items. As they should all be >> scalar, you should be able to do it in a single execute and get a >> numeric vector of the correct length. >> Assign the numeric vector back into Z. You may want to do a ravel >> each if you would rather have one element vectors >> Reshape Z to have the same number columns as in pattern >> >> Dyalog and some other APLs (APLX?) have a []CSV function. The problem >> gets a bit harder when you have optional quotes around strings, >> embedded separators, and have to figure out yourself whether something >> is numeric or char. >> >> >> >> On 1/17/17, Elias Mårtenson <loke...@gmail.com> wrote: >> > I wanted to use GNU APL to work on a dataset of star data. The file >> > consists of 34030 lines of the following form: >> > >> > 892376 3813 4.47 0.4699 1.532 0.007 7306.69 0.823 0.4503 0 --- >> > 1026146 4261 4.57 0.6472 14.891 0.12 11742.56 1.405 0.7229 0 --- >> > 1026474 4122 4.56 0.5914 1.569 0.006 30471.8 1.204 0.6061 0 --- >> > 1162635 3760 4.77 0.4497 15.678 0.019 10207.47 0.978 0.5445 1 --- >> > >> > I wrote a generic CSV loader to handle this (source code at the end of >> this >> > email), and loaded the data like so: >> > >> > * z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'* >> > >> > This took many minutes to load, which in my opinion shouldn't happen. >> > >> > Now, I have a few questions: >> > >> > >> > 1. Is there a way to speed up this code? >> > 2. Is there something that could be done on the GNU APL >> implementation >> > side to make this faster? >> > 3. Shouldn't we have a generic ⎕CSV function or something like that >> > which would be able to load CSV files in milliseconds regardless of >> > size? >> > This should be trivial to do in C++. >> > >> > Here's the code in question: >> > >> > ∇Z ← type convert_entry value >> > →('n'≡type)/numeric >> > →('s'≡type)/string >> > ⎕ES 'Illegal conversion type' >> > numeric: >> > Z←⍎value >> > →end >> > string: >> > Z←value >> > end: >> > ∇ >> > >> > ∇Z ← pattern read_csv filename ;fd;line;separator >> > separator ← ' ' >> > Z ← 0 (↑⍴pattern) ⍴ ⍬ >> > fd ← 'r' FIO∆fopen filename >> > >> > next: >> > line ← FIO∆fgets fd ⍝ Read one line from the file >> > →(⍬≡line)/end >> > →(10≠line[⍴line])/skip_nl ⍝ If the line ends in a newline >> > line ← line[⍳¯1+⍴line] ⍝ Remove the newline >> > skip_nl: >> > line ← ⎕UCS line >> > Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line >> > →next >> > end: >> > >> > FIO∆fclose fd >> > ∇ >> > >> > >