On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty C <costo...@hotmail.com> wrote:
>> You claim you want filtering to be as fast as possible. If that were >> so, you would not pack multiple keys (or features thereof) into a >> bignum but rather would store the keys individually. > >chasing pointers? no, you're thinking about doing some sort of >byte-append and subbytes type of thing. only way that data in a >hash would be small in memory and reasonably quick. care to >elaborate? I think you understand perfectly. You said the keys are 128-bit (16 byte) values. You can store one key directly in a byte string of length 16. However, each individual byte string has allocator overhead [though less than a bignum, see below]. So instead of using a vector of pointers to individual byte strings, you would allocate a single byte string of length {buckets x chain length x 16} and index it directly as if it were a 3-dimensional array [using offset calculations as you would in C]. There might be an easier approach using multi-dimensional arrays. However, the array library is Typed Racket and there are issues mixing "typed" code with "untyped" code. I haven't looked into TR extensively so I don't know offhand if these arrays are "packed type homogenous" or simply "type homogenous" [the difference would be memory use, see below]. Can you explain why the bignums are important. For a simple task like filtering a file, they would seem to be both a waste of memory and a performance drain wrt storing the key values as byte strings. >> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, > >where are you getting this? i've been digging all over the documentation >and can't find a fn thing on how much space is required for any data type >or it's overhead. > >so 2 questions for you all >1)where is this info about data type memory requirements? I just know a lot of this from books and lots of experience. I've been programming for over 30 years, professionally for 23. I have an MS in Computer Science (and coursework for a 2nd MS but I left school before doing that dissertation). Among other things I have worked professionally on both compiler internals and GC. I know (quite) a bit about how to implement Lisp and Scheme, and Racket [at least the so-called "untyped" version] is a fairly typical tagged data system. You can figure out a lot of it by careful reading, in particular about the C API and library (FFI) interfaces. For things like bignums, you have to look into the code and discover that Racket uses the GMP library for multiple-precision math (and then look into GMP for the details). The size of heap object descriptors (i.e. allocator and type overhead) necessarily is version, platform and compiler dependent. You have to figure it out for the particular implementation. Some background on tagged data: At the most basic level, there are only 2 types: "immediate" values which are ALU register sized - 32 or 64 bit depending - and "reference" values which are heap allocated objects accessed via pointers. [The terms "immediate" and "reference" come originally from Lisp.] Immediate values include fixnums (small integers), characters and bytes, booleans, and pointers. Immediates are type tagged by stealing a few bits from the (register width) value, so, fixnum integers and pointers do not quite have their full range [e.g., in Racket fixnums are 31 or 63 bits depending on platform]. A pointer to a reference value is an immediate value. Reference values are essentially anything that is not an immediate value. Most importantly, floating point values are reference values because all of their bits are relevant to the FPU and it isn't possible to tag them by stealing bits as is done with immediates. In general, heap objects always contain some number of generic "slots" each of which is an immediate value. However there is an important exception: for certain types, it is possible to create homogenous "packed" vectors in which the elements are not generic "slots" but rather are values of a specified type. Racket permits creating packed vectors from bytes, characters, fixnum integers, floats and doubles. For characters and bytes the packed vector is called a "string". For the others, their packed variants are distinct vector types. The GMP library represents bignums as an information struct and a data array (of C longs). The array always contains at least 1 element, grows as needed but never shrinks. The struct contains a pointer and 2 longs. The size of pointers and longs are compiler dependent, so you need to know which compiler was used. On Linux Racket is compiled with GCC so pointers and longs are the same size: either 32-bits or 64-bits. The GMP documentation doesn't address allocation specifically ... it's treated as a black box ... but the info struct is defined and its members are laid out in a way that suggests the allocations might be separate (i.e. there is a pointer to the data array in the middle of the struct) Whereas the byte string of length N is just N bytes. [Obviously every heap object has a descriptor and the whole object is subject to alignment requirements and allocation unit rounding. I don't know the details of Racket's allocator, but the minimum allocation unit in a Lisp/Scheme typically is the size of 2 pointers [to match a cons cell]. In actual fact the minimum allocation might be based on CPU cache lines. On x86-64 that would be 64 bytes.] That's where my numbers come from. I'm estimating bignum size as based on GCC compiled 64-bit Racket running on Linux. So for the smallest bignum we get: struct: 3 x 8 bytes data : 1 x 8 bytes ------------------------- = 4 x 8 bytes = 32 bytes + 1 or 2 descriptors Guessing (again from experience) that each heap object descriptor contains at least 2 pointer sized fields, I add 16 to get to 48 bytes as the smallest possible bignum assuming the struct and data array are together in a single allocation. If they are separate allocations, then the minimum size jumps to 80 because both struct and array are odd length (in long words) and would be rounded upward for the minimum allocation unit. I don't know the layout of a heap descriptor ... it may, in fact, be larger than 2 pointers - but's that's a popular minimum. >2)what is a better way of checking actual memory sucked up by my data >structures? (current-memory-use) George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.