Since the Linux "free" command knows the difference between how much real
memory vs. virtual memory is available, it may be useful to use a similar
method.

Blake


On Tue, Dec 28, 2021 at 11:34 AM Dr. Jürgen Sauermann <
mail@jürgen-sauermann.de> wrote:

> Hi Russ,
>
> it has turned out to be very difficult to find a reliable way to figure
> the memory that is available for
> a process like the GNU APL interpreter. One (of several) reasons for that
> is that GNU linux tells you
> that it has more memory than it really has (so-called over-commitment).
> You can malloc() more memory
> than you have and malloc will return a virtual memory (and no error) when
> you ask for it. However, if you
> access the virtual memory at a later point in time (i.e. requesting real
> memory for it) then the process
> may crash badly if that real memory is not available.
>
> GNU APL deals with this by assuming (by default) that the total memory
> available for the GNU APL process is about 2 GB,
>
> You can increase that amount (outside apl) be setting a larger memory
> limit, probably with:
>
> *ulimit --virtual-memory-size* or *ulimit -v *(depending on platform).
>
> in the shell or script before starting apl. However, note that:
>
> * *ulimit -v* *ulimited* will not work because this is the default and
> GNU APL will apply its default of 2 GB in this case,
> * the WS FULL behaviour of GNU APL (ans ⎕WA) will become unreliable. You
> must ensure that GNU APL will get
>   as much memory as you promised it with ulimit, and
> * all GNU APL cells have the same size (e.g. 24 byte on a 64-bit CPU, see *apl
> -l37*) even if the cells contain only
>  Booleans.
>
> The workaround for really large Boolean arrays is to pack them into the
> 64-bits of a GNU APL integer
> and maybe use the bitwise functions (⊤∧, ⊤∨, ...) of GNU APL to access
> them group-wise.
>
> Best Regards,
> Jürgen
>
>
> On 12/28/21 3:53 AM, Russtopia wrote:
>
> Hi, doing some experiments in learning APL I was writing a word frequency
> count program that takes in a document, identifies unique words and then
> outputs the top 'N' occurring words.
>
> The most straightforward solution, to me, seems to be ∘.≡ which works up
> to a certain dataset size. The main limiting statement in my program is
>
> wordcounts←+⌿ (wl ∘.≡ uniqwords)
>
> .. which generates a large boolean array which is then tallied up for each
> unique word.
>
> I seem to run into a limit in GNU APL. I do not see an obvious ⎕SYL
> parameter to increase the limit and could not find any obvious reference in
> the docs either. What are the absolute max rows/columns of a matrix, and
> can the limit be increased? Are they separate or a combined limit?
>
>       5 wcOuterProd 'corpus/135-0-5000.txt'    ⍝⍝ 5000-line document
> Time: 26419 ms
>   the   of   a and  to
>  2646 1348 978 879 858
>       ⍴wl
> 36564
>       ⍴ uniqwords
> 5695
>
>       5 wcOuterProd 'corpus/135-0-7500.txt'   ⍝⍝ 7500-line document
> WS FULL+
> wcOuterProd[8]  wordcounts←+⌿(wl∘.≡uniqwords)
>                               ^             ^
>       ⍴ wl
> 58666
>       ⍴ uniqwords
> 7711
>
>
> I have an iterative solution which doesn't use a boolean matrix to count
> the words, rather looping through using pick/take and so can handle much
> larger documents, but it takes roughy 2x the execution time.
>
> Relating to this, does GNU APL optimize boolean arrays to minimize storage
> (ie., using larger bit vectors rather than entire ints per bool) and is
> there any clever technique other experience APLers could suggest to
> maintain the elegant 'loop-free' style of computing but avoid generating
> such large bool matrices? I thought of perhaps a hybrid approach where I
> iterate through portions of the data and do partial ∘.≡ passes but of
> course that complicates the algorithm.
>
> [my 'outer product' and 'iterative' versions of the code are below]
>
> Thanks,
> -Russ
>
> ---
> #!/usr/local/bin/apl --script
>  ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
> ⍝                                                                    ⍝
> ⍝ wordcount.apl                        2021-12-26  20:07:07 (GMT-8)  ⍝
> ⍝                                                                    ⍝
>  ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
>
> ⍝ function edif has ufun1 pointer 0!
>
> ∇r ← timeMs; t
>   t ← ⎕TS
>   r ← (((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7]
> ∇
>
> ∇r ← lowerAndStrip s;stripped;mixedCase
>  stripped ← '
> abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
>  mixedCase ← ⎕av[11],'
> ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>  r ← stripped[mixedCase ⍳ s]
> ∇
>
> ∇c ← h wcIterative fname
>   ⍝⍝;D;WL;idx;len;word;wc;wcl;idx
>   ⍝⍝ Return ⍒-sorted count of unique words in string vector D, ignoring
> case and punctuation
>   ⍝⍝ @param h(⍺) - how many top word counts to return
>   ⍝⍝ @param D(⍵) - vector of words
>   ⍝⍝⍝⍝
>   D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
>   timeStart ← timeMs
>   D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of words
>   WL ← ∪D
>   ⍝⍝#DEBUG# ⎕ ← 'unique words:',WL
>   wcl ← 0⍴0
>   idx ← 1
>   len ← ⍴WL
> count:
>   ⍝⍝#DEBUG# ⎕ ← idx
>   →(idx>len)/done
>   word ← ⊂idx⊃WL
>   ⍝⍝#DEBUG# ⎕ ← word
>   wc ← +/(word≡¨D)
>   wcl ← wcl,wc
>   ⍝⍝#DEBUG# ⎕ ← wcl
>   idx ← 1+idx
>   → count
> done:
>   c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl]
>   timeEnd ← timeMs
>   ⎕ ← 'Time:',(timeEnd-timeStart),'ms'
> ∇
>
> ∇r ← n wcOuterProd fname
>   ⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder
>   D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
>   timeStart ← timeMs
>   wl ← (~ D ∊ ' ') ⊂ D
>   ⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl
>   uniqwords ← ∪wl
>   ⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords
>   wordcounts ← +⌿(wl ∘.≡ uniqwords)
>   sortOrder ← ⍒wordcounts
>   r ← n↑[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
>   timeEnd ← timeMs
>   ⎕ ← 'Time:',(timeEnd-timeStart),'ms'
> ∇
>
>
>
>

Reply via email to