Hi Ala'a, you can use ⎕FIO ¯1 to find out where the time is spent, e.g.: T←⎕FIO ¯1 file ← 'test.txt' 'T1:' ((T←⎕FIO ¯1)-T) ⎕ ← ⍴w ← words ftxt file 'T2:' ((T←⎕FIO ¯1)-T) ⎕ ← ⍴u ← ∪w 'T3:' ((T←⎕FIO ¯1)-T) desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u 'T4:' ((T←⎕FIO ¯1)-T) Your downcase function fails on my machine: ⎕ ← ⍴w ← words ftxt file INDEX ERROR+ λ1[1] λ←(a,⎕AV)[(A,⎕AV)⍳⍵] ^ ^ )MORE ⎕IO=1 offending index=282 max index=282 probably due to a character in my testfile that is not contained in ⎕AV. You should use ⎕UCS instead of ⎕AV to avoid that: downcase←{ ⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ } /// Jürgen On 09/11/2016 08:23 PM, Ala'a Mohammad
wrote:
Just an update as a reference, I'm now able to parse the big.txt file (without WS full or killed process), but it takes around 2 Hours and 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The process reach 1GiB (after parsing the words), and tops that with 100MiB during the sequential 'Each' (thus a max of 1.1GiB).The only change is scanning each unique word against the whole words vector. Below is the code with a sample timed run. Regards, Ala'a ⍝ fhist.apl a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' alphamask ← { ~ ⍵ ∊ nonalpha } words ← { (alphamask ⍵) ⊂ downcase ⍵ } desc ← {⍵[⍒⍵[;2];]} ftxt ← { ⎕FIO[26] ⍵ } file ← '/misc/big.txt' ⍝ ~ 6.2M ⎕ ← ⍴w ← words ftxt file ⎕ ← ⍴u ← ∪w desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u )OFF : time apl -s -f fhist.apl 1098281 30377 the 80003 of 40025 to 28760 in 22048 for 6936 by 6736 be 6154 or 5349 all 4141 this 4058 are 3627 other 1488 before 1363 should 1297 over 1282 your 1276 any 1204 our 1065 holmes 450 country 417 world 355 project 286 gutenberg 262 laws 233 sir 176 series 128 sure 123 sherlock 101 ebook 85 copyright 69 changing 44 check 38 arthur 30 adventures 17 redistributing 7 header 7 doyle 5 downloading 5 conan 4 apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com> wrote:Thanks to all for the input, Replacing Find and Each OR with Match helped, now I'm parsing a 159K (~1545 lines) text file (a sample chunk from the big.txt). The strange thing for me that I'm trying to understand is that the APL process (when fed the 159K text file) start allocating memory until it reaches 2.7GiB, then after printing the result settle down to 50MiB. Why do I need 2.7GiB? is there any memory utils (i.e. Garbage collection utility) which can be used to mitigate this issue? Here is the updated code: a ← 'abcdefghijklmnopqrstuvwxyz' A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' alphamask ← { ~ ⍵ ∊ nonalpha } words ← { (alphamask ⍵) ⊂ downcase ⍵ } hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper desc ← {⍵[⍒⍵[;2];]} ftxt ← { ⎕FIO[26] ⍵ } fhist ← { hist words ftxt ⍵ } file ← '/misc/llaa' ⍝ llaa contains 1546 text lines ⎕ ← ⍴w ← words ftxt file ⎕ ← ⍴u ← ∪w desc 39 2 ⍴ fhist file And here is a sample run : apl -s -f fhist.apl 30186 4155 the 1560 to 804 of 781 in 493 for 219 be 173 holmes 164 your 132 this 114 all 99 by 97 are 97 or 73 other 56 over 51 our 48 should 47 before 43 sherlock 39 any 35 sir 26 sure 13 country 9 project 6 gutenberg 6 ebook 5 adventures 5 world 5 arthur 4 conan 4 doyle 4 series 2 copyright 2 laws 2 check 2 header 2 changing 1 downloading 1 redistributing 1 Also attached the sample input file Regards, On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com> wrote:On 9 September 2016 at 23:39, Ala'a Mohammad wrote:the errors happened inside 'hist' function, and I presume mostly due to the jot dot find (if understand correctly, operating on a matrix of length equal to : unique-length * words-length)Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵. -k |
- [Bug-apl] Spell corrector - APL Ala'a Mohammad
- Re: [Bug-apl] Spell corrector - APL Christian Robert
- Re: [Bug-apl] Spell corrector - APL Xiao-Yong Jin
- Re: [Bug-apl] Spell corrector - APL Juergen Sauermann
- [Bug-apl] Fwd: Re: Spell corrector - APL Christian Robert
- Re: [Bug-apl] Fwd: Re: Spell corrector - A... Juergen Sauermann
- Re: [Bug-apl] Spell corrector - APL Kacper Gutowski
- Re: [Bug-apl] Spell corrector - APL Ala'a Mohammad
- Re: [Bug-apl] Spell corrector - APL Juergen Sauermann
- Re: [Bug-apl] Spell corrector - APL Juergen Sauermann
- Re: [Bug-apl] Spell corrector - AP... Ala'a Mohammad
- Re: [Bug-apl] Spell corrector - APL Jay Foad
- Re: [Bug-apl] Spell corrector - APL Ala'a Mohammad
- Re: [Bug-apl] Spell corrector - AP... Jay Foad
- Re: [Bug-apl] Spell corrector ... Juergen Sauermann
- Re: [Bug-apl] Spell corrector ... Ala'a Mohammad
- Re: [Bug-apl] Spell corrector - APL Juergen Sauermann