Re: [Bug-apl] Spell corrector - APL

Ala'a Mohammad Sun, 11 Sep 2016 11:26:33 -0700

Just an update as a reference, I'm now able to parse the big.txt file
(without WS full or killed process), but it takes around 2 Hours and
20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
process reach 1GiB (after parsing the words), and tops that with
100MiB during the sequential 'Each' (thus a max of 1.1GiB).


The only change is scanning each unique word against the whole words vector.

Below is the code with a sample timed run.

Regards,

Ala'a

⍝ fhist.apl
a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }

file ← '/misc/big.txt' ⍝ ~ 6.2M
⎕ ← ⍴w ← words ftxt file
⎕ ← ⍴u ← ∪w
desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
)OFF

: time apl -s -f fhist.apl
1098281
30377
 the            80003
 of             40025
 to             28760
 in             22048
 for             6936
 by              6736
 be              6154
 or              5349
 all             4141
 this            4058
 are             3627
 other           1488
 before          1363
 should          1297
 over            1282
 your            1276
 any             1204
 our             1065
 holmes           450
 country          417
 world            355
 project          286
 gutenberg        262
 laws             233
 sir              176
 series           128
 sure             123
 sherlock         101
 ebook             85
 copyright         69
 changing          44
 check             38
 arthur            30
 adventures        17
 redistributing     7
 header             7
 doyle              5
 downloading        5
 conan              4

apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total

On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <amal...@gmail.com> wrote:
> Thanks to all for the input,
>
> Replacing Find and Each OR with Match helped, now I'm parsing a 159K
> (~1545 lines) text file (a sample chunk from the big.txt).
>
> The strange thing for me that I'm trying to understand is that the APL
> process (when fed the 159K text file) start allocating memory until it
> reaches 2.7GiB, then after printing the result settle down to 50MiB.
> Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
> collection utility) which can be used to mitigate this issue?
>
> Here is the updated code:
>
> a ← 'abcdefghijklmnopqrstuvwxyz'
> A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> alphamask ← { ~ ⍵ ∊ nonalpha }
> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
> desc ← {⍵[⍒⍵[;2];]}
> ftxt ← { ⎕FIO[26] ⍵ }
> fhist ← { hist words ftxt ⍵ }
>
> file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
> ⎕ ← ⍴w ← words ftxt file
> ⎕ ← ⍴u ← ∪w
> desc 39 2 ⍴ fhist file
>
> And here is a sample run
> : apl -s -f fhist.apl
> 30186
> 4155
>  the            1560
>  to              804
>  of              781
>  in              493
>  for             219
>  be              173
>  holmes          164
>  your            132
>  this            114
>  all              99
>  by               97
>  are              97
>  or               73
>  other            56
>  over             51
>  our              48
>  should           47
>  before           43
>  sherlock         39
>  any              35
>  sir              26
>  sure             13
>  country           9
>  project           6
>  gutenberg         6
>  ebook             5
>  adventures        5
>  world             5
>  arthur            4
>  conan             4
>  doyle             4
>  series            2
>  copyright         2
>  laws              2
>  check             2
>  header            2
>  changing          1
>  downloading       1
>  redistributing    1
>
> Also attached the sample input file
>
> Regards,
>
> On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <mwgam...@gmail.com> wrote:
>> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
>>> the errors happened inside 'hist' function, and I presume mostly due
>>> to the jot dot find (if understand correctly, operating on a matrix of
>>> length equal to : unique-length * words-length)
>>
>> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
>>
>> -k

Re: [Bug-apl] Spell corrector - APL

Reply via email to