Re: DRAFT RFC: Enhanced Pack/Unpack

Glenn Linderman Sat, 05 Aug 2000 20:14:20 -0700
Edwin Wiles wrote:

> Without them, the programmer must calculate the required length of the reads
> themselves.

Good point.  I now want them, rather than being ambivalent.

> >    [ 'bar' => 'i', 'baz' => 'i', 'count' => 'i' ]
>
> It is my understanding that "=>" is an indication of a link in a hash
> array between its name and its value.  Given the necessity of
> maintaining the order of elements, this nomenclature would appear to be
> inappropriate.

>From perldata man page:

     It is often more readable to use the => operator between key/value pairs
     (the => operator is actually nothing more than a more visually distinctive
     synonym for a comma):

I've seen it used in function parameter lists and other lists, as well as the list of
initial values for a hash (which is the example that follows the above statement in
the man page).

> Still, I like the idea of making the relationship more explicit, we
> could do it this way:
>
>         [ ['bar','int32'], ['baz','int32'], ['count','int32'] ]

We could, indeed.  Or, we could use => just as visual sugar to specify the
relationship... it does preserve the order!  I think I'd prefer that to rectangularly
Lispifying things too much.

> Basic Object Type:
>
> 'short' - whatever your computer understands is a short.
> 'long' - ditto.
> 'int' - ditto.
> 'double' - ditto.
> 'float' - ditto.
> 'char' - ditto, with the understanding that Unicode makes these
>          characters larger than you would expect.
> 'byte' - whatever your computer understands is a byte.  (Yes, there
>          are some systems that don't use 8 bits to a byte.  Not many,
>          but they are there.)
> 'octet' - specifically 8 bits, regardless of the byte size of your
> system.

I don't have any problem with having names like these.  However, the data that is
being read may or may not have originated on the computer trying to read it.  While
you seem to recognize that in "byte" vs "octet", you've eliminated the ability to do
that for specifically sized integers (and floats, but cross machine floats are a
bigger problem, and probably require a user extension).  Likewise, we probably need
specifically sized char in addition to "normal" char for the machine: both one byte
and two byte (for Unicode, and this one has endianness issues).

Theory: The people that want natural sizes "based on the machine" don't deal with
moving data from one machine to another, just moving programs from one machine to
another.  The people that want fixed sizes deal with moving data from one machine to
another, and have faced the problems that entails.

Having both is OK (but redundant) to the second camp; but leaving out fixed size
specifications is not... because then as you move the program from machine to machine
you have to adjust the structure definitions to correspond to the actual sizes of the
"natural" sizes that this machine supports.  Coding the actual sizes in the first
place saves that effort.

The whole point of this feature is to communicate with data in rigid binary formats;
for API calls, "natural" may be better, if it matches the definitions in other
languages, for on-disk data structure interpretation, fixed size is better.  Either
can be used to emulate the other, but the emulation causes problems.  Supporting both
"natural" and "fixed" sizes seems to be necessary to make everyone happy.

And remember, however the structure is specified, when unpacked, it becomes normal,
easy and quick to manipulate, ordinary Perl variables.

> To which we may append a number of different modifiers:
>
> Endianness:
>
> /d - default, whatever your system uses normally
> /n - network
> /b - big
> /l - little
> /v - vax
>
> (??vax is sufficiently different to require it's own??  Or is this to do
> with bit order?)

I don't speak VAX, but it is my understanding that VAX is different than little endian
in some manner.  So we need "little" for 80x86/Pentium and others; we need "big" for
Motorola 68K and others; we may need VAX is there is something truly different there;
"network" is redundant with "big", at the present time.

> Signedness:
>
> /u - unsigned
> /s - signed
>
> We may also allow these modifiers in the definition of the structure, so
> that the entire structure can be affected, without having to explicitly
> modify each variable.

Both places, that works for me.

> Non-Standard sizes:
>
> /[0-9]+ - The basic object is this many bits long.
>
> I strongly suggest that we settle on bits as the standard of length,
> since if we start mixing bits versus bytes on the basic elements, we're
> going to confuse the living daylights out of people.

Totally agree here.

Right now I'm a little concerned how you are suggesting to do arrays of structures
with this or similar syntax.  The previous technique handled that fine, by nesting
arrays of arrays.

> > 4) allow hooks to support non-standard types.
> >   sub Structure::define ( <type name>, <frombinarysub>, <tobinarysub> )
> >
> >   sub from_funny ( <type_params>, <binary_var>, <bit_offset> )
> >   # returns ( <next_bit_offset>, <funny_var> )
> >
> >   sub to_funny ( <type_params>, <binary_var>, <bit_offset>, <funny_var> )
> >   # returns ( <next_bit_offset> )
>
> This combination presumes that the data has already been loaded into an
> internal variable.  Given variable length data, this is not a valid
> assumption.

You are correct: this does assume fixed length data.  That was definitely my mindset.
I intended this for being able to crack floating point numbers of funny formats, or
complex-number-pairs (30+15i), but not for dealing with variable length data: I left
that to arrays.  So I sidestepped your length issues in this interface, by intending
it only for fixed length scalar data.  If we can figure out a generalization that
works, I'm not against implementing the more general interface in place of this one,
but I don't see a desparate need for it.

> >   Structure::define ( 'funny', \&from_funny, \&to_funny );
> >
> >   [ 'var1' => 'int32', 'var2' => ['funny', 6, 18, 12]]
> >
> >   <type_params> is the reference to the array defined in the definition, in this
> > example, it would be ['funny_type', 6, 18, 12]  (it appears that the type
> > 'funny' has three parameters to define its storage characteristics).
>
>         The format for 'funny' would turn to this:
>
>         [ ['var1','int/32'], ['var2','funny/6/18/12'] ]

OK.  Or    'var2:funny/6/18/12'

> [['bar','int'],['baz','int'],['count','int'],['myarray',
>         ['array','count',[['length','int'],['offset','int']]]]]

OK, so you are allowing two syntaxes for arrays... one for arrays of scalers, and a
more complex one for arrays of structures.  Which is OK, as long as the more complex
one, when used to specify an array of scalers, actually works correctly too.

> (Is it just me, or is this beginning to look like rectangular lisp?)

It kind of does.  So I like the elimination of some [] in favor of doing a bit more
parsing at "new Structure" time.

> > >         Okay, that looks like it might work, now add in the strings
> > >         referenced by length and offset.  [Ideas anyone?]
> >
> > OK, here's a (Forth or Pascal or BASIC) counted string:
>
> I think you missed my intent.  The number of strings is the same as the
> number of length/offset pairs.  The length of each string is determined
> by the corresponding 'length' value from the pair.  The position of each
> string in the data following the already parsed portion is determined by
> the 'offset' value from the pair.

OK, that's certainly different from what I responded with, but I claim there is a need
for the stuff I suggested in addition to what you intended.  Look at it as an
additional portion of the proposal.

> We would have to have some indication of where the 'offset' is starting
> from; the beginning of this structure, the first byte of the area
> _following_ the array of length/offset pairs, etc.
>
> I've chosen not to comment further, as we have enough differences at
> this point to resolve. (And I'd like to get _something_ out the door!)
>
> So, what'cha'think?
>
> E.W.

I think we've agreed that "new Structure" can afford to do a little parsing on its
parameters, so we can avoid as much rectangular Lisp as we can.  The rectangular Lisp
may be useful as a representation of Structure's internal data, but is a poor user
interface.  I'd suggest your syntax for the type related information is fine, but just
jam it after the name, separated by a ":" (or something).  I've given a few examples
above.

I think we need both "natural" and "explicit" sizes for common data types, to solve
two problems: structures passed to local APIs (generally use "natural" sizes), and
structures for the network or disk (generally need explicit sizes).  We strongly agree
that explicit sizes should be in terms of bit lengths.

I don't have a clue how to use 'p' and 'P' for  pack/unpack either.  The others are
pretty obvious to me.

Signedness is a good extension.

Maybe the array of structures syntax can be simplified; how does this look?  This
removes one level of []....

  ['bar' => 'int/32', 'baz' => 'int/32', 'count' => 'int/32',
      'struct_2' => ['array', 'count', 'length' => 'int/32', 'offset' => 'int/32']]

or

  ['bar' => 'int/32', 'baz' => 'int/32', 'count' => 'int/32',
      'struct_2' => ['array[count]', 'length' => 'int/32', 'offset' => 'int/32']]

or, with judicious use of bare words:

  [bar => 'int/32', baz => 'int/32', count => 'int/32',
      struct_2 => ['array[count]', length => 'int/32', offset => 'int/32']]

And a special case for character type has the array length specified as 'null',
meaning it is variable length and null terminated.

I think this is enough for a first version (you called your first a 'draft') of an
RFC, to get it into a slightly more official status.

We still need to address your original intention, I'll try my hand at that below.

And if we want to roll in some additional data types to handle Data::Dumper style
stuff, to linearize random Perl values, that would be a nice extension.


Strings defined by offset/length pairs.

the base address for offsets could be any of several obvious things and probably
several non-obvious things... i.e. we need a hook for those.

Seems like we need to have a name for the string, separate from the name for the
offset/length field that defines it, so that we can reference it.  Let's try a simple
case first: one string.  Maybe the following would be a possibility

[ offset => 'int/32', length => 'int/32', stuff => 'char[length]@offset/begin' ]

@ would claim that the string is not here, but @ its parameter offset.  The following
slash would introduce a keyword for comment offset cases (/begin might be the default)
here is a possible list:

/begin   - offset from beginning of this total structure
/end      - end of the otherwise defined total structure
               [i.e. beginning of string table]
/here     - from the location of this offset variable

More items could be added to this list; for other common offset bases; a hook could be
done by specifying a sub name, either as "/&subname" or just "&subname", not sure what
all to pass into the sub, but it needs to return the base for the offset.

For arrays, we would need to find some way to correlate arrays of offset/length pairs
with the array of strings... although a bit complex, it could be done as:

  [bar => 'int/32', baz => 'int/32', count => 'int/32',
      struct_2 => ['array[count]', length => 'int/32', offset => 'int/32']
      stuff => ['array[count]', str => 'char[length[]]@offset[]/begin' ]]

the empty [] indicates to use the index of stuff as the index of length and offset, as
needed.

--
Glenn
=====
There  are two kinds of people, those
who finish  what they start,  and  so
on...                 -- Robert Byrne
Re: DRAFT RFC: Enhanced Pack/Unpack

Reply via email to