Since the Linux "free" command knows the difference between how much real memory vs. virtual memory is available, it may be useful to use a similar method.
Blake On Tue, Dec 28, 2021 at 11:34 AM Dr. Jürgen Sauermann < mail@jürgen-sauermann.de> wrote: > Hi Russ, > > it has turned out to be very difficult to find a reliable way to figure > the memory that is available for > a process like the GNU APL interpreter. One (of several) reasons for that > is that GNU linux tells you > that it has more memory than it really has (so-called over-commitment). > You can malloc() more memory > than you have and malloc will return a virtual memory (and no error) when > you ask for it. However, if you > access the virtual memory at a later point in time (i.e. requesting real > memory for it) then the process > may crash badly if that real memory is not available. > > GNU APL deals with this by assuming (by default) that the total memory > available for the GNU APL process is about 2 GB, > > You can increase that amount (outside apl) be setting a larger memory > limit, probably with: > > *ulimit --virtual-memory-size* or *ulimit -v *(depending on platform). > > in the shell or script before starting apl. However, note that: > > * *ulimit -v* *ulimited* will not work because this is the default and > GNU APL will apply its default of 2 GB in this case, > * the WS FULL behaviour of GNU APL (ans ⎕WA) will become unreliable. You > must ensure that GNU APL will get > as much memory as you promised it with ulimit, and > * all GNU APL cells have the same size (e.g. 24 byte on a 64-bit CPU, see *apl > -l37*) even if the cells contain only > Booleans. > > The workaround for really large Boolean arrays is to pack them into the > 64-bits of a GNU APL integer > and maybe use the bitwise functions (⊤∧, ⊤∨, ...) of GNU APL to access > them group-wise. > > Best Regards, > Jürgen > > > On 12/28/21 3:53 AM, Russtopia wrote: > > Hi, doing some experiments in learning APL I was writing a word frequency > count program that takes in a document, identifies unique words and then > outputs the top 'N' occurring words. > > The most straightforward solution, to me, seems to be ∘.≡ which works up > to a certain dataset size. The main limiting statement in my program is > > wordcounts←+⌿ (wl ∘.≡ uniqwords) > > .. which generates a large boolean array which is then tallied up for each > unique word. > > I seem to run into a limit in GNU APL. I do not see an obvious ⎕SYL > parameter to increase the limit and could not find any obvious reference in > the docs either. What are the absolute max rows/columns of a matrix, and > can the limit be increased? Are they separate or a combined limit? > > 5 wcOuterProd 'corpus/135-0-5000.txt' ⍝⍝ 5000-line document > Time: 26419 ms > the of a and to > 2646 1348 978 879 858 > ⍴wl > 36564 > ⍴ uniqwords > 5695 > > 5 wcOuterProd 'corpus/135-0-7500.txt' ⍝⍝ 7500-line document > WS FULL+ > wcOuterProd[8] wordcounts←+⌿(wl∘.≡uniqwords) > ^ ^ > ⍴ wl > 58666 > ⍴ uniqwords > 7711 > > > I have an iterative solution which doesn't use a boolean matrix to count > the words, rather looping through using pick/take and so can handle much > larger documents, but it takes roughy 2x the execution time. > > Relating to this, does GNU APL optimize boolean arrays to minimize storage > (ie., using larger bit vectors rather than entire ints per bool) and is > there any clever technique other experience APLers could suggest to > maintain the elegant 'loop-free' style of computing but avoid generating > such large bool matrices? I thought of perhaps a hybrid approach where I > iterate through portions of the data and do partial ∘.≡ passes but of > course that complicates the algorithm. > > [my 'outer product' and 'iterative' versions of the code are below] > > Thanks, > -Russ > > --- > #!/usr/local/bin/apl --script > ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝ > ⍝ ⍝ > ⍝ wordcount.apl 2021-12-26 20:07:07 (GMT-8) ⍝ > ⍝ ⍝ > ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝ > > ⍝ function edif has ufun1 pointer 0! > > ∇r ← timeMs; t > t ← ⎕TS > r ← (((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7] > ∇ > > ∇r ← lowerAndStrip s;stripped;mixedCase > stripped ← ' > abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*' > mixedCase ← ⎕av[11],' > ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' > r ← stripped[mixedCase ⍳ s] > ∇ > > ∇c ← h wcIterative fname > ⍝⍝;D;WL;idx;len;word;wc;wcl;idx > ⍝⍝ Return ⍒-sorted count of unique words in string vector D, ignoring > case and punctuation > ⍝⍝ @param h(⍺) - how many top word counts to return > ⍝⍝ @param D(⍵) - vector of words > ⍝⍝⍝⍝ > D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines > timeStart ← timeMs > D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of words > WL ← ∪D > ⍝⍝#DEBUG# ⎕ ← 'unique words:',WL > wcl ← 0⍴0 > idx ← 1 > len ← ⍴WL > count: > ⍝⍝#DEBUG# ⎕ ← idx > →(idx>len)/done > word ← ⊂idx⊃WL > ⍝⍝#DEBUG# ⎕ ← word > wc ← +/(word≡¨D) > wcl ← wcl,wc > ⍝⍝#DEBUG# ⎕ ← wcl > idx ← 1+idx > → count > done: > c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl] > timeEnd ← timeMs > ⎕ ← 'Time:',(timeEnd-timeStart),'ms' > ∇ > > ∇r ← n wcOuterProd fname > ⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder > D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines > timeStart ← timeMs > wl ← (~ D ∊ ' ') ⊂ D > ⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl > uniqwords ← ∪wl > ⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords > wordcounts ← +⌿(wl ∘.≡ uniqwords) > sortOrder ← ⍒wordcounts > r ← n↑[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder] > timeEnd ← timeMs > ⎕ ← 'Time:',(timeEnd-timeStart),'ms' > ∇ > > > >