[racket-users] Re: appending files

George Neuner Thu, 28 Jan 2016 15:16:55 -0800

On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty C
<costo...@hotmail.com> wrote:


>> You claim you want filtering to be as fast as possible.  If that were
>> so, you would not pack multiple keys (or features thereof) into a
>> bignum but rather would store the keys individually.
>
>chasing pointers? no, you're thinking about doing some sort of
>byte-append and subbytes type of thing. only way that data in a
>hash would be small in memory and reasonably quick. care to 
>elaborate?

I think you understand perfectly.

You said the keys are 128-bit (16 byte) values.  You can store one key
directly in a byte string of length 16.  However, each individual byte
string has allocator overhead  [though less than a bignum, see below].
So instead of using a vector of pointers to individual byte strings,
you would allocate a single byte string of length 
   {buckets x chain length x 16} 
and index it directly as if it were a 3-dimensional array [using
offset calculations as you would in C].

There might be an easier approach using multi-dimensional arrays.
However, the array library is Typed Racket and there are issues mixing
"typed" code with "untyped" code.  I haven't looked into TR
extensively so I don't know offhand if these arrays are "packed type
homogenous" or simply "type homogenous"  [the difference would be
memory use, see below].


Can you explain why the bignums are important.  For a simple task like
filtering a file, they would seem to be both a waste of memory and a
performance drain wrt storing the key values as byte strings.



>> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, 
>
>where are you getting this? i've been digging all over the documentation
>and can't find a fn thing on how much space is required for any data type
>or it's overhead. 
>
>so 2 questions for you all
>1)where is this info about data type memory requirements?

I just know a lot of this from books and lots of experience.

I've been programming for over 30 years, professionally for 23.  I
have an MS in Computer Science (and coursework for a 2nd MS but I left
school before doing that dissertation).  Among other things I have
worked professionally on both compiler internals and GC.  I know
(quite) a bit about how to implement Lisp and Scheme, and Racket [at
least the so-called "untyped" version] is a fairly typical tagged data
system.


You can figure out a lot of it by careful reading, in particular about
the C API and library (FFI) interfaces.  For things like bignums, you
have to look into the code and discover that Racket uses the GMP
library for multiple-precision math (and then look into GMP for the
details).

The size of heap object descriptors (i.e. allocator and type overhead)
necessarily is version, platform and compiler dependent.  You have to
figure it out for the particular implementation.


Some background on tagged data:

At the most basic level, there are only 2 types: "immediate" values
which are ALU register sized - 32 or 64 bit depending - and
"reference" values which are heap allocated objects accessed via
pointers.
[The terms "immediate" and "reference" come originally from Lisp.]

Immediate values include fixnums (small integers), characters and
bytes, booleans, and pointers.  Immediates are type tagged by stealing
a few bits from the (register width) value, so, fixnum integers and
pointers do not quite have their full range [e.g., in Racket fixnums
are 31 or 63 bits depending on platform].

A pointer to a reference value is an immediate value.

Reference values are essentially anything that is not an immediate
value.  Most importantly, floating point values are reference values
because all of their bits are relevant to the FPU and it isn't
possible to tag them by stealing bits as is done with immediates.

In general, heap objects always contain some number of generic "slots"
each of which is an immediate value.  However there is an important
exception:  for certain types, it is possible to create homogenous
"packed" vectors in which the elements are not generic "slots" but
rather are values of a specified type.  Racket permits creating packed
vectors from bytes, characters, fixnum integers, floats and doubles.  

For characters and bytes the packed vector is called a "string".  For
the others, their packed variants are distinct vector types.



The GMP library represents bignums as an information struct and a data
array (of C longs).  The array always contains at least 1 element,
grows as needed but never shrinks.  The struct contains a pointer and
2 longs.  

The size of pointers and longs are compiler dependent, so you need to
know which compiler was used.  On Linux Racket is compiled with GCC so
pointers and longs are the same size: either 32-bits or 64-bits.

The GMP documentation doesn't address allocation specifically ... it's
treated as a black box ... but the info struct is defined and its
members are laid out in a way that suggests the allocations might be
separate (i.e. there is a pointer to the data array in the middle of
the struct)

Whereas the byte string of length N is just N bytes.

[Obviously every heap object has a descriptor and the whole object is
subject to alignment requirements and allocation unit rounding.  I
don't know the details of Racket's allocator, but the minimum
allocation unit in a Lisp/Scheme typically is the size of 2 pointers
[to match a cons cell].

In actual fact the minimum allocation might be based on CPU cache
lines.  On x86-64 that would be 64 bytes.]



That's where my numbers come from.  I'm estimating bignum size as
based on GCC compiled 64-bit Racket running on Linux.  So for the
smallest bignum we get:

    struct: 3 x 8 bytes
    data  : 1 x 8 bytes 
 -------------------------
    =        4 x 8 bytes = 32 bytes + 1 or 2 descriptors

Guessing (again from experience) that each heap object descriptor
contains at least 2 pointer sized fields, I add 16 to get to 48 bytes
as the smallest possible bignum assuming the struct and data array are
together in a single allocation.  

If they are separate allocations, then the minimum size jumps to 80
because both struct and array are odd length (in long words) and would
be rounded upward for the minimum allocation unit.

I don't know the layout of a heap descriptor ... it may, in fact, be
larger than 2 pointers - but's that's a popular minimum.



>2)what is a better way of checking actual memory sucked up by my data 
>structures?

(current-memory-use)


George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

Reply via email to