Re: [Bug-apl] Use with word2vec

Fred Weigel Sun, 30 Apr 2017 22:11:19 -0700

Juergen
This is useful -- I was looking at LApack.cc already. It is in line with
what I need (as a template).
I am not worried about saving these things, but I have a 3000000x300
array of C float,and do a 300 element vector by 300 element multiply on
each of the 3 million rows in a "typical"processing step. I don't want
to convert to C double (that would increase memory from 3.6GB to
7.2GB).I don't really want to copy the data at all! I can generate a
descriptor to the data (memory pointer, dimensions).   I think I want to
plant the data into a shared memory region (and, in future, pass it to a
GPU).
I think I want to do some specific functions on the data -- right now I
pass in row sets to GNU APL usingthe API, and execute APL code using the
API. However, the control is exclusively from outside APL,meaning I
cannot experimentally analyze using APL.
I can work on the model given by LApack.cc, and supply some functions
which (basically) providea "virtual memory/workspace".
The main problem with these array sizes is saving and loading -- this
array would be around 30GB inGNU APL (as far as I can tell). If ever
saved, it would then take 300GB. I can convert from float to double,and
create the Cell structures, but I would want to simply mmap() the thing
into GNU APL (and, of course,never have the thing participate in memory
management). Again, I was leaning towards partial mapping.Because, when
I start with tensors, the arrays will be sparse.
So, two real problems -- (1) how to deal with LARGE non-sparse matrices,
and (2) how to deal withLARGE sparse matrices.
I really like the expression afforded by APL.
It may be possible to use the APL parser,  and provide new
implementations of primitives -- thanksfor that idea.
LApack.cc seems to provide for something I can start with -- the actual
LARGE arrays won't changeso this provides a good demark point and start
for something workable. 
Thanks!Fred Weigel




On Sat, 2017-04-29 at 13:04 +0200, Juergen Sauermann wrote:
>     Hi Fred,
> 
>       
> 
>       I have not fully understood what you want to do exactly, but is
>       looks to me as if you want to go for
> 
>       native GNU APL functions. Native functions provide the means to
>       bypass the GNU APL interpreter
> 
>       itself to the extent desired. For example you can use APL
> variables
>       but not the APL parser, or the
> 
>       APL parser but not the implementation of primitives, or whatever
>       else you are up to.
> 
>       
> 
>       As to plain double vectors, it is very difficult to introduce
> them
>       as a new built-in data type because that
> 
>       change would affect: every APL primitive, every APL operator,
>       )LOAD, )SAVE, )DUMP, and a lot
> 
>       more.
> 
>       
> 
>       However, you can have a look at (the top-level of) the
>       implementation of the matrix divide primitive which
> 
>       is doing what you are maybe after. The implementation of matrix
>       divide expects either a double vector or
> 
>       a complex<double> vector as argument(s) and returns such a
>       vector as result. Before and after the computation
> 
>       of matrix divide a conversion between APL values and the plain
>       double or complex vector is performed.
> 
>       This conversion is very lightweight. If you have a homogenious
> GNU
>       APL value, say all revel items being double,
> 
>       then that value is almost like a C double *. The difference is a
>       space between adjacent ravel elements. In other
> 
>       words (expressed in APL):
> 
>       
> 
>       C_vector ←→ 1 0 1 0 ... / APL_vector
> 
>       
> 
>       I can provide you with more information if you want to go along
>       this path.
> 
>       
> 
>       /// Jürgen
> 
>       
> 
>       
> 
>       
> 
>     
> 
>     On 04/29/2017 03:19 AM, Fred Weigel
>       wrote:
> 
>     
>     
> >       Jeurgen, and other GNU APL experts.
> > 
> > I am exploring neural nets, word2vec and some other AI related
> > areas.
> > 
> > Right now, I want to tie in google's word2vec trained models (the
> > billion word one GoogleNews-vectors-negative300.bin.gz)
> > 
> > This is a binary file containing a lot of floating point data --
> > about
> > 3.5GB of data. These are words, followed by cosine distances. I
> > could
> > attempt to feed this in slow way, and put it into an APL workspace. 
> > But... I also intend on attempting to feed the data to a GPU. So,
> > what I
> > am looking for is a modification to GNU APL (and yes, I am willing
> > to do
> > the work) -- to allow for the complete suppression of normal C++
> > allocations, etc. and allow the introduction of simple float/double
> > vectors or matrices (helpful to allow "C"-ish or UTF-8-ish strings:
> > the
> > data is (C string containing word name) (fixed number of floating
> > point)... repeated LOTs of times.
> > 
> > The data set(s) may be compressed, so I don't want read them
> > directly --
> > possibly from a shared memory region (64 bit system only, of
> > course), or
> > , perhaps using shared variables... but I don't think that would be
> > fast
> > enough.
> > 
> > Anyway, this begins to allow the push into "big data" and AI
> > applications. Just looking for some input and ideas here.
> > 
> > Many thanks
> > Fred Weigel
> > 
> > 
> > 
> >     
> 
>     
> 
>   
> 
>

Re: [Bug-apl] Use with word2vec

Reply via email to