Hi Hans, I will try that! Thank you.
By the way... I notice all APL non-ascii symbols seem to be replaced
by just '?' in my mailing list archive emails.. is this is an issue
with my mail provider, or is the UTF-8/APL/whatever encoding of
mailing list posts being lost in the daily Digests?
On Tue, 28 Dec 2021 at 13:15, <bug-apl-requ...@gnu.org> wrote:
Send Bug-apl mailing list submissions to
bug-apl@gnu.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.gnu.org/mailman/listinfo/bug-apl
or, via email, send a message with subject or body 'help' to
bug-apl-requ...@gnu.org
You can reach the person managing the list at
bug-apl-ow...@gnu.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bug-apl digest..."
Today's Topics:
1. Re: Absolute limits of rank 2 bool matrix size in GNU APL?
(Hans-Peter Sorge)
2. Re: Absolute limits of rank 2 bool matrix size in GNU APL?
(Dr. J?rgen Sauermann)
----------------------------------------------------------------------
Message: 1
Date: Tue, 28 Dec 2021 20:42:05 +0100
From: Hans-Peter Sorge <hanspeterso...@netscape.net>
To: bug-apl@gnu.org
Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL?
Message-ID: <98a89ead-6880-9b91-4086-c76f93691...@netscape.net>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi Rus,
looks like the outer product is a not needed - you have the unique
words
and along the line you got the word count too.
you take the sorted word vector
swv ?'aa' 'bb' 'bb' 'cc' 'cc' 'ff' 'gg'
then you create a partition vector from it
pv?+\1,~2?/swv
pv
1 2 2 3 3 4 5
partition for wc
?????pv?pv
1 ?2 2 ?3 3 ?4 ?5
Then the wc is
wc???? pv ? pv
?????wc
and the unique words are
uw?1?? pv ? swv
?????uw
aa bb cc ff gg
finally the listing of occurrences
??????uw,?wc
aa 1
bb 2
cc 2
ff 1
gg 1
Best Regards
Hans-Peter
Am 28.12.21 um 03:53 schrieb Russtopia:
> Hi, doing some experiments in learning APL I was writing a word
> frequency count program that takes in a document, identifies unique
> words and then outputs the top 'N' occurring words.
>
> The most straightforward solution, to me, seems to be ?.? which
works
> up to a certain dataset size. The main limiting statement in my
program is
>
> wordcounts?+? (wl ?.? uniqwords)
>
> .. which generates a large boolean array which is then tallied
up for
> each unique word.
>
> I seem to run into a limit in GNU APL. I do not see an obvious ?SYL
> parameter to increase the limit and could not find any obvious
> reference in the docs either. What are the absolute max
rows/columns
> of a matrix, and can the limit be increased? Are they separate or a
> combined limit?
>
> 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ?? 5000-line document
> Time: 26419 ms
> ? the ? of ? a and ?to
> ?2646 1348 978 879 858
> ? ? ? ?wl
> 36564
> ? ? ? ? uniqwords
> 5695
>
> 5 wcOuterProd 'corpus/135-0-7500.txt'? ??? 7500-line document
> WS FULL+
> wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ? ^
> ? ? ? ? wl
> 58666
> ? ? ? ? uniqwords
> 7711
>
>
> I have an iterative solution which doesn't use a boolean matrix to
> count the words, rather looping through using pick/take and so can
> handle much larger documents, but it takes roughy 2x the
execution time.
>
> Relating to this, does GNU APL optimize boolean arrays to minimize
> storage (ie., using larger bit vectors rather than entire ints per
> bool) and is there any clever technique other experience APLers
could
> suggest to maintain the elegant 'loop-free' style of computing but
> avoid generating such large bool matrices? I thought of perhaps a
> hybrid approach where I iterate through portions of the data and do
> partial ?.? passes but of course that complicates the algorithm.
>
> [my 'outer product' and 'iterative' versions of the code are below]
>
> Thanks,
> -Russ
>
> ---
> #!/usr/local/bin/apl --script
>
?????????????????????????????????????????????????????????????????????
> ? ? ? ? ? ??
> ? wordcount.apl ? ? ? ? ? ? ? ? ? ? ? ?2021-12-26 ?20:07:07
(GMT-8) ??
> ? ? ? ? ? ??
>
?????????????????????????????????????????????????????????????????????
>
> ? function edif has ufun1 pointer 0!
>
> ?r ? timeMs; t
> ? t ? ?TS
> ? r ? (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7]
> ?
>
> ?r ? lowerAndStrip s;stripped;mixedCase
> ?stripped ? ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
> ?mixedCase ? ?av[11],'
> ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
> ?r ? stripped[mixedCase ? s]
> ?
>
> ?c ? h wcIterative fname
> ? ??;D;WL;idx;len;word;wc;wcl;idx
> ? ?? Return ?-sorted count of unique words in string vector D,
> ignoring case and punctuation
> ? ?? @param h(?) - how many top word counts to return
> ? ?? @param D(?) - vector of words
> ? ????
> ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with
newlines
> ? timeStart ? timeMs
> ? D ? (~ D ? ' ') ? D ? make into a vector of words
> ? WL ? ?D
> ? ??#DEBUG# ? ? 'unique words:',WL
> ? wcl ? 0?0
> ? idx ? 1
> ? len ? ?WL
> count:
> ? ??#DEBUG# ? ? idx
> ? ?(idx>len)/done
> ? word ? ?idx?WL
> ? ??#DEBUG# ? ? word
> ? wc ? +/(word??D)
> ? wcl ? wcl,wc
> ? ??#DEBUG# ? ? wcl
> ? idx ? 1+idx
> ? ? count
> done:
> ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl]
> ? timeEnd ? timeMs
> ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> ?
>
> ?r ? n wcOuterProd fname
> ? ?? ;D;wl;uniqwords;wordcounts;sortOrder
> ? D ? lowerAndStrip (?fio['read_file'] fname) ?? raw text with
newlines
> ? timeStart ? timeMs
> ? wl ? (~ D ? ' ') ? D
> ? ??#DEBUG# ? ? '? wl:', ? wl
> ? uniqwords ? ?wl
> ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords
> ? wordcounts ? +?(wl ?.? uniqwords)
> ? sortOrder ? ?wordcounts
> ? r ? n?[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
> ? timeEnd ? timeMs
> ? ? ? 'Time:',(timeEnd-timeStart),'ms'
> ?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://lists.gnu.org/archive/html/bug-apl/attachments/20211228/f246993a/attachment.htm>
------------------------------
Message: 2
Date: Tue, 28 Dec 2021 22:15:44 +0100
From: Dr. J?rgen Sauermann <m...@xn--jrgen-sauermann-zvb.de>
To: Blake McBride <blake1...@gmail.com>, Elias M?rtenson
<loke...@gmail.com>
Cc: bug-apl <bug-apl@gnu.org>, Russtopia <rma...@gmail.com>
Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL?
Message-ID:
<dbe63e06-ae3a-74af-1a87-d90557991...@xn--jrgen-sauermann-zvb.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Hi,
I have spent many hours to do a proper memory handling. The current
solution is not perfect but will work
in many cases. If anybody has suggestions how to improve then I'll be
happy to use them.
The current approach is roughly this (if I remember correctly)
1. if the user specifies a *ulimit -v* other than *unlimited* then
use
it and assume it will be available
at any time. Otherwise assume a limit of 2 GB memory.
2. decrease the limit a little so that C++ exception handlers can
allocate their memory.
3. during apl execution, keep track of the amunt of memory used and
released by *operator new
*(= essentially malloc), which is roughly the physical memory used
by apl.
4. When the limit is reached, raise WS FULL.
Before that, the apl process was sometimes killed (un-catchable)
before
it could raise WS FULL,
which is IMHO worse than raising WS FULL a little too early (i.e.
before
the memory is really
exhausted. The current approach means in Blake's terms, that we
try our
best to stay at Level 1.
You can actually go directly from Level 1 to Level 3 by creating a
large
apl value, so any solution
for only one level does not really help.
As far as I can see, the free utility is just a nicer front-end to
*/proc/meminfo*.
We cannot check */proc/meminfo* before every memory allocation, so
we do
it at start-up and trust
our own memory bookkeeping. There is a little slack caused by malloc,
but our safety-margin 2.
above handles that.
I tried earlier to depend on paging, but I was running into problems
(crashes instead of paging
even though the paging space was huge enough).
Also, conceptually *?WA *becomes useless if it can change without
anything happening in apl
(e.g, when other processes use memory). In theory we could not only
allocate but also
use all memory beforehand, but that is not very practical if it is
not
used. In the old days,
a process would know how much memory it has, but nowadays it does not.
Best Regards,
J?rgen
On 12/28/21 8:25 PM, Blake McBride wrote:
> I think what you're talking about and what I am talking about
are not
> exactly the same thing.? I can see what you are saying, but that
is a
> contrived example.? For example:
>
> Level 1:? you are using the RAM that exists (not over-committed)
>
> Level 2:? you are using more RAM than you have causing
over-commit and
> paging.
>
> Level 3:? you allocate more memory than exists in RAM?+ paging.
>
> I am talking about Levels 1 and 2.? You are talking about Level
3.? At
> least in my world, I have never hit Level 3 because Level 2 becomes
> intolerable slow and I kill the process myself.? For this
reason, in
> my world, Level 3 never occurs.
>
> It doesn't make sense to handle Level 2 problems with Level 3 logic.
>
> Blake
>
>
> On Tue, Dec 28, 2021 at 1:16 PM Elias M?rtenson <loke...@gmail.com
> <mailto:loke...@gmail.com>> wrote:
>
> Actually, the default memory allocator in glibc uses mmap
and not
> sbrk, if I'm not mistaken. However, they both do the same
thing at
> the end of the day, and it's to request more pages from the
kernel.
>
> Unfortunately, Linux malloc never returns NULL. Even if you
try to
> allocate a petabyte in one allocation. You can write to this
> memory and new pages will be created as you write, and at some
> point your write will fail with a SEGV because there are no free
> page left.
>
> What's even worse is that once you start to fail, the kernel
will
> randomly kill processes until it gets more free pages. If that
> sounds idiotic, that's because it is, but that's how it works.
>
> You can configure the kernel to use not do this, but
unfortunately
> you're going to have other problems if you do, primarily because
> all Linux software is written with the default behaviour in
mind.
>
> You can try it right now. Write a C program that allocates a few
> TB of RAM and see what happens. Actually, nothing will happen
> until you start writing to this memory.
>
> Den ons 29 dec. 2021 03:09Blake McBride <blake1...@gmail.com
> <mailto:blake1...@gmail.com>> skrev:
>
> Hi J?rgen,
>
> First, to be sure you know where I'm coming from.? I know
> malloc() and sbrk() pretty well.? I've written
malloc's?before.
>
> It is utter news to me that malloc would return a
pointer that
> you can't use - even in over-commit?situations.?
> Theoretically, in over-commit situations, if you access a
> pointer that's paged out, the system should page it in
first.?
> If that doesn't happen, that's a bug in the OS.? I'd
create a
> simple program that mimics this behavior and submit it as a
> bug report.? (What I almost always find is that during the
> process of creating the sample program, I find the
actual bug
> in my own program.)
>
> What I'd try to do is write my own malloc that does the
following:
>
> 1. Checks for the amount of real memory available using
> something I took from the /free/ command.
>
> 2. If there is enough memory, call the real malloc and
return it.
>
> 3. If there is not enough real memory, rather than calling
> malloc and causing paging you can just return a null.
>
> Of course, I understand that sbrk() adjusts the system
> breakpoint essentially?giving you more heap space. Malloc
> manages that free space that's already been taken from
the OS
> via sbrk().
>
> With regard to item #1, you'd have to have an
intelligent way
> of determining when to calculate the true available space.?
> You can't do it for 16 bytes, and you don't want to do
it for
> every allocation.? On the other hand for large
allocations or
> if you haven't queried the system for some time, it makes
> sense to update your opinion of the available space.
>
> Thanks.
>
> Blake
>
>
>
>
> On Tue, Dec 28, 2021 at 12:50 PM Dr. J?rgen Sauermann
> <mail@j?rgen-sauermann.de <http://rgen-sauermann.de>
> <mailto:mail@j%C3%BCrgen-sauermann.de
<http://sauermann.de>>> wrote:
>
> Hi Blake,
>
> I know. The problem is that other processes may, just as
> GNU APL does, allocate memory
> successfully via malloc(). malloc, on "success"
returns a
> non-zero virtual memory address.
>
> The fact that malloc returns a non-zero address (=
> indicating malloc() success) does unfortunately
> not guarantee that you would be able to access that
> memory. You may or may
> not get a page fault when accessing the memory. The
point
> in time when you call 'free' or
> similar tools (which basically read /proc/memory
tell you
> the situation at that time which can
> be very different from the situation when you fill your
> virtual memory with real data (which
> assigns a physical memory page to the successfully
> allocated virtual memory page) and
> that may fail. Even worse, when this happen then your
> machine has no more memory and
> the C++ exception handlers will also fails due to
lack of
> memory and the process is terminated
> without any chance to raise a WS full or even remain
in apl.
>
> Best Regards,
> J?rgen
>
>
> On 12/28/21 6:52 PM, Blake McBride wrote:
>> Since the Linux "free" command knows the difference
>> between how much real memory vs. virtual memory is
>> available, it may be useful to use a similar method.
>>
>> Blake
>>
>>
>> On Tue, Dec 28, 2021 at 11:34 AM Dr. J?rgen Sauermann
>> <mail@j?rgen-sauermann.de <http://rgen-sauermann.de>
>> <mailto:mail@j%C3%BCrgen-sauermann.de
<http://sauermann.de>>> wrote:
>>
>> Hi Russ,
>>
>> it has turned out to be very difficult to find a
>> reliable way to figure the memory that is
available for
>> a process like the GNU APL interpreter. One (of
>> several) reasons for that is that GNU linux
tells you
>> that it has more memory than it really has
(so-called
>> over-commitment). You can malloc() more memory
>> than you have and malloc will return a virtual
memory
>> (and no error) when you ask for it. However, if you
>> access the virtual memory at a later point in time
>> (i.e. requesting real memory for it) then the
process
>> may crash badly if that real memory is not
available.
>>
>> GNU APL deals with this by assuming (by
default) that
>> the total memory available for the GNU APL
process is
>> about 2 GB,
>>
>> You can increase that amount (outside apl) be
setting
>> a larger memory limit, probably with:
>>
>> *ulimit --virtual-memory-size*or *ulimit -v
>> *(depending on platform).
>>
>> in the shell or script before starting apl.
However,
>> note that:
>>
>> * *ulimit -v* *ulimited* will not work because this
>> is the default and GNU APL will apply its
default of
>> 2 GB in this case,
>> * the WS FULL behaviour of GNU APL (ans ?WA) will
>> become unreliable. You must ensure that GNU APL
will get
>> ? as much memory as you promised it with
ulimit, and
>> * all GNU APL cells have the same size (e.g. 24
byte
>> on a 64-bit CPU, see *apl -l37*) even if the cells
>> contain only
>> ?Booleans.
>>
>> The workaround for really large Boolean arrays
is to
>> pack them into the 64-bits of a GNU APL integer
>> and maybe use the bitwise functions (??, ??,
...) of
>> GNU APL to access them group-wise.
>>
>> Best Regards,
>> J?rgen
>>
>>
>> On 12/28/21 3:53 AM, Russtopia wrote:
>>> Hi, doing some experiments in learning APL I was
>>> writing a word frequency count program that
takes in
>>> a document, identifies unique words and then
outputs
>>> the top 'N' occurring words.
>>>
>>> The most straightforward solution, to me, seems to
>>> be ?.? which works up to a certain dataset
size. The
>>> main limiting statement in my program is
>>>
>>> wordcounts?+? (wl ?.? uniqwords)
>>>
>>> .. which generates a large boolean array which is
>>> then tallied up for each unique word.
>>>
>>> I seem to run into a limit in GNU APL. I do
not see
>>> an obvious ?SYL parameter to increase the
limit and
>>> could not find any obvious reference in the docs
>>> either. What are the absolute max rows/columns
of a
>>> matrix, and can the limit be increased? Are they
>>> separate or a combined limit?
>>>
>>> ? ? ? 5 wcOuterProd 'corpus/135-0-5000.txt'? ? ??
>>> 5000-line document
>>> Time: 26419 ms
>>> ? the ? of ? a and ?to
>>> ?2646 1348 978 879 858
>>> ? ? ? ?wl
>>> 36564
>>> ? ? ? ? uniqwords
>>> 5695
>>>
>>> ? ? ? 5 wcOuterProd 'corpus/135-0-7500.txt'? ???
>>> 7500-line document
>>> WS FULL+
>>> wcOuterProd[8] ?wordcounts?+?(wl?.?uniqwords)
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ ? ? ? ? ? ^
>>> ? ? ? ? wl
>>> 58666
>>> ? ? ? ? uniqwords
>>> 7711
>>>
>>>
>>> I have an iterative solution which doesn't use a
>>> boolean matrix to count the words, rather looping
>>> through using pick/take and so can handle much
>>> larger documents, but it takes roughy 2x the
>>> execution time.
>>>
>>> Relating to this, does GNU APL optimize boolean
>>> arrays to minimize storage (ie., using larger bit
>>> vectors rather than entire ints per bool) and is
>>> there any clever technique other experience APLers
>>> could suggest to maintain the elegant 'loop-free'
>>> style of computing but avoid generating such large
>>> bool matrices? I thought of perhaps a hybrid
>>> approach where I iterate through portions of the
>>> data and do partial ?.? passes but of course that
>>> complicates the algorithm.
>>>
>>> [my 'outer product' and 'iterative' versions
of the
>>> code are below]
>>>
>>> Thanks,
>>> -Russ
>>>
>>> ---
>>> #!/usr/local/bin/apl --script
>>>
?????????????????????????????????????????????????????????????????????
>>> ? ??
>>> ? wordcount.apl ? ? ?2021-12-26 ?20:07:07
(GMT-8) ??
>>> ? ??
>>>
?????????????????????????????????????????????????????????????????????
>>>
>>> ? function edif has ufun1 pointer 0!
>>>
>>> ?r ? timeMs; t
>>> ? t ? ?TS
>>> ? r ?
>>> (((t[3]?86400)+(t[4]?3600)+(t[5]?60)+(t[6]))?1000)+t[7]
>>> ?
>>>
>>> ?r ? lowerAndStrip s;stripped;mixedCase
>>> ?stripped ? '
>>> abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
>>> ?mixedCase ? ?av[11],'
>>>
,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> ?r ? stripped[mixedCase ? s]
>>> ?
>>>
>>> ?c ? h wcIterative fname
>>> ? ??;D;WL;idx;len;word;wc;wcl;idx
>>> ? ?? Return ?-sorted count of unique words in
string
>>> vector D, ignoring case and punctuation
>>> ? ?? @param h(?) - how many top word counts to
return
>>> ? ?? @param D(?) - vector of words
>>> ? ????
>>> ? D ? lowerAndStrip (?fio['read_file'] fname)
?? raw
>>> text with newlines
>>> ? timeStart ? timeMs
>>> ? D ? (~ D ? ' ') ? D ? make into a vector of
words
>>> ? WL ? ?D
>>> ? ??#DEBUG# ? ? 'unique words:',WL
>>> ? wcl ? 0?0
>>> ? idx ? 1
>>> ? len ? ?WL
>>> count:
>>> ? ??#DEBUG# ? ? idx
>>> ? ?(idx>len)/done
>>> ? word ? ?idx?WL
>>> ? ??#DEBUG# ? ? word
>>> ? wc ? +/(word??D)
>>> ? wcl ? wcl,wc
>>> ? ??#DEBUG# ? ? wcl
>>> ? idx ? 1+idx
>>> ? ? count
>>> done:
>>> ? c ? h?[2] (WL)[?wcl],[0.5]wcl[?wcl]
>>> ? timeEnd ? timeMs
>>> ? ? ? 'Time:',(timeEnd-timeStart),'ms'
>>> ?
>>>
>>> ?r ? n wcOuterProd fname
>>> ? ?? ;D;wl;uniqwords;wordcounts;sortOrder
>>> ? D ? lowerAndStrip (?fio['read_file'] fname)
?? raw
>>> text with newlines
>>> ? timeStart ? timeMs
>>> ? wl ? (~ D ? ' ') ? D
>>> ? ??#DEBUG# ? ? '? wl:', ? wl
>>> ? uniqwords ? ?wl
>>> ? ??#DEBUG# ? ? '? uniqwords:', ? uniqwords
>>> ? wordcounts ? +?(wl ?.? uniqwords)
>>> ? sortOrder ? ?wordcounts
>>> ? r ? n?[2]
>>> uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
>>> ? timeEnd ? timeMs
>>> ? ? ? 'Time:',(timeEnd-timeStart),'ms'
>>> ?
>>>
>>>
>>
>
------------------------------
Subject: Digest Footer
_______________________________________________
Bug-apl mailing list
Bug-apl@gnu.org
https://lists.gnu.org/mailman/listinfo/bug-apl
------------------------------
End of Bug-apl Digest, Vol 99, Issue 13
***************************************