Thanks for the pointer. About ⎕FIO ¯1, I can not find it in the displayed help (i.e. using ⎕FIO ''), and what is the output resolution? is it similar to GetTickCount in windows, or Linux gettimeofday?
Regards, Ala'a On Mon, Sep 12, 2016 at 2:23 PM, Juergen Sauermann <juergen.sauerm...@t-online.de> wrote: > Hi again, > > sorry, I meant: > > downcase←{ ⎕UCS (32×(⍵≥65)∧⍵≤90)+⍵←⎕UCS ⍵ } > > /// Jürgen > > > On 09/12/2016 12:10 PM, Juergen Sauermann wrote: > > Hi Ala'a, > > you can use ⎕FIO ¯1 to find out where the time is spent, e.g.: > > T←⎕FIO ¯1 > file ← 'test.txt' > 'T1:' ((T←⎕FIO ¯1)-T) > ⎕ ← ⍴w ← words ftxt file > 'T2:' ((T←⎕FIO ¯1)-T) > ⎕ ← ⍴u ← ∪w > 'T3:' ((T←⎕FIO ¯1)-T) > desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u > 'T4:' ((T←⎕FIO ¯1)-T) > > Your downcase function fails on my machine: > > ⎕ ← ⍴w ← words ftxt file > INDEX ERROR+ > λ1[1] λ←(a,⎕AV)[(A,⎕AV)⍳⍵] > ^ ^ > > )MORE > ⎕IO=1 offending index=282 max index=282 > > probably due to a character in my testfile that is not contained in ⎕AV. > You should use ⎕UCS instead of ⎕AV to avoid that: > > downcase←{ ⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ } > > /// Jürgen > > > On 09/11/2016 08:23 PM, Ala'a Mohammad wrote: > > Just an update as a reference, I'm now able to parse the big.txt file > (without WS full or killed process), but it takes around 2 Hours and > 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The > process reach 1GiB (after parsing the words), and tops that with > 100MiB during the sequential 'Each' (thus a max of 1.1GiB). > > The only change is scanning each unique word against the whole words vector. > > Below is the code with a sample timed run. > > Regards, > > Ala'a > > ⍝ fhist.apl > a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > alphamask ← { ~ ⍵ ∊ nonalpha } > words ← { (alphamask ⍵) ⊂ downcase ⍵ } > desc ← {⍵[⍒⍵[;2];]} > ftxt ← { ⎕FIO[26] ⍵ } > > file ← '/misc/big.txt' ⍝ ~ 6.2M > ⎕ ← ⍴w ← words ftxt file > ⎕ ← ⍴u ← ∪w > desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u > )OFF > > : time apl -s -f fhist.apl > 1098281 > 30377 > the 80003 > of 40025 > to 28760 > in 22048 > for 6936 > by 6736 > be 6154 > or 5349 > all 4141 > this 4058 > are 3627 > other 1488 > before 1363 > should 1297 > over 1282 > your 1276 > any 1204 > our 1065 > holmes 450 > country 417 > world 355 > project 286 > gutenberg 262 > laws 233 > sir 176 > series 128 > sure 123 > sherlock 101 > ebook 85 > copyright 69 > changing 44 > check 38 > arthur 30 > adventures 17 > redistributing 7 > header 7 > doyle 5 > downloading 5 > conan 4 > > apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total > > On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com> wrote: > > Thanks to all for the input, > > Replacing Find and Each OR with Match helped, now I'm parsing a 159K > (~1545 lines) text file (a sample chunk from the big.txt). > > The strange thing for me that I'm trying to understand is that the APL > process (when fed the 159K text file) start allocating memory until it > reaches 2.7GiB, then after printing the result settle down to 50MiB. > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage > collection utility) which can be used to mitigate this issue? > > Here is the updated code: > > a ← 'abcdefghijklmnopqrstuvwxyz' > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > alphamask ← { ~ ⍵ ∊ nonalpha } > words ← { (alphamask ⍵) ⊂ downcase ⍵ } > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper > desc ← {⍵[⍒⍵[;2];]} > ftxt ← { ⎕FIO[26] ⍵ } > fhist ← { hist words ftxt ⍵ } > > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines > ⎕ ← ⍴w ← words ftxt file > ⎕ ← ⍴u ← ∪w > desc 39 2 ⍴ fhist file > > And here is a sample run > : apl -s -f fhist.apl > 30186 > 4155 > the 1560 > to 804 > of 781 > in 493 > for 219 > be 173 > holmes 164 > your 132 > this 114 > all 99 > by 97 > are 97 > or 73 > other 56 > over 51 > our 48 > should 47 > before 43 > sherlock 39 > any 35 > sir 26 > sure 13 > country 9 > project 6 > gutenberg 6 > ebook 5 > adventures 5 > world 5 > arthur 4 > conan 4 > doyle 4 > series 2 > copyright 2 > laws 2 > check 2 > header 2 > changing 1 > downloading 1 > redistributing 1 > > Also attached the sample input file > > Regards, > > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com> wrote: > > On 9 September 2016 at 23:39, Ala'a Mohammad wrote: > > the errors happened inside 'hist' function, and I presume mostly due > to the jot dot find (if understand correctly, operating on a matrix of > length equal to : unique-length * words-length) > > Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵. > > -k > > >