RFC 273 (v1) Internal representation of Pseudo-hashes using attributes.

Perl6 RFC Librarian Thu, 21 Sep 2000 19:27:14 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Internal representation of Pseudo-hashes using attributes.

=head1 VERSION

  Maintainer: Michael Maraist <[EMAIL PROTECTED]>
  Date: 27 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 273
  Version: 1
  Status: Developing

=head1 ABSTRACT

Pseudo-hashes were an attempt to mix the speed of arrays with the named
fields of hashes.  For compatibility issues, the new data-type used 
[ { field-mapping }, data-list ], which at the very least looks bizzar.  It
has the required overhead of a dereference and does not behave well as an
array since the field-mapping is required to be the first field.

An attempt was made to make this the basis for classes, making use of
pragmas such as 'fields', 'base', and module functions such as fields::new,
fields::phash.  Performance was really secondary to the ability to perform
strict-type checking.  It was also necessary to use 'my Foo $dog = new Foo'
to perform compile-time optimizations.

There is a growing need to provide structured named parameters.
Pseudo-hashes provide the type-checking, and _can_ provide the ordering of
fields, though non-transparently.

Other RFC's have proposed plausible structured data-types, namely the 'qs'
and 'struct' keywords.  This RFC, however, proposes a modification of the
existing system that maintains procedural compatibility, and allows for the
integration of named structures.  It provides enhancements for the OOP
model, and provides a 100% compatible method of named subroutine parameters
by merely extending the concept of strict-hashes through the use of
attributes.

=head1 DESCRIPTION

 sub getStat {
  my %stat: fields( name size ctime mtime inode );
  $stat{name} = 'Foo'; # array 0 access
  $stat{inode} = 50;   # array 4 access
  
  my @fields = keys %stat;      # get raw internal key listing
  my @stat_list = values %stat; # get raw internal data listing
  my %loose_stat = %stat;       # interlace internal key and data listings

  $stat{nme} = 'Bar';       # compiler error!
  $loose_stat{nme} = 'Bar'; # OK, regular hash operations

  @stat{qw(name size)}; # compiled to @stat[ 0, 1 ]
  # In general, slice works in all cases except modification of index positions

  return \%stat;        # references entire structure and meta-data
 } # end getStat

 my $rh_stat = getStat();
 $rh_stat->{size};  # OK, performs internal lookup to find size == 1
 $rh_stat->{nme};   # run-time error! (after ! exists internal lookup for nme )


The above describes how a variable attribute can be used to impose
type-checking and compile-time optimizations to a normal hash.  The
modified hash should always look and feel the same, exept in the case of
type-checking.

This is very similar to pseudo-hashes with the following exceptions:

=over 4

=item *

We define the typed-hash with attributes instead of using the first
array-position.

=item *

We use an internal (and transparent) representation of name-to-index mapping.

=item *

We redundantly store the key-names in an array for speed-up (essentially
swapping space for speed).  Though there are alternatives (below).

=back

Additionally, this can work transparently with the OO approach to allow
compiler optimizations for standard returned hash refs. Just like current
pseudo-hashes as in:

 {
   package Foo;
   use class qw( a b );  # same functionality as 'fields', but internal
   sub new { class::new(shift) } # again, class works like fields
   # It does something similar to:
   # eval "my $class \%this; return \\\%this;";
   # but in an optimized way.
   # a => 0, b => 1

 }
 {
   package Bar;
   use extends qw( Foo Other);  # similar functionality to 'base', but internal
     # also allows multiple inheritance (compiler error if conflicting name-space)
     # If this turns out to be problematic, single inheritance can be used.
   use class qw( c d ); 
   # a => 0, b => 1, c => 2, d => 3
 
   # new inherited
 }

 my Bar $obj = new Bar;
 $obj->{a};   # compile-time mapping of a => 0
 $obj->{zap}; # compiler error!

 my @keys = keys %$obj;     # internal fetching of ordered key-list
 my @values = values %$obj; # internal fetching of ordered values
 my @array_hash = %$obj;    # loses type-checking: ( a, undef, b, undef, c, undef, d, 
undef )
 my %hash = %$obj;          # loses type-checking: same as above

 # The following makes use of the above class as a pre-defined version of ": 
fields(..)"
 # All that follows is optional, but a logical extension of the above
 my %h_bar: Bar;  # Uses that above class to define a hash
 $h_bar{a};       # OK, same as $h_bar[ 0 ]
 $h_bar{zap};     # compiler-error

 sub doFoo : Bar {  # optionally restricts parameters at compile time (when it knows 
ahead 
                    # of time)
   my %args : Bar = @_;  # takes in a normal set of hash-ed parameters, but restricts 
them
     # Essentially performs a faster hash-assignment
   $args{a};  # similar to $args[ 0 ];
 } # end doFoo

The above would be similar to:

 my %doFoo_valid_args = qw( a 1 b 1 c 1 d 1 ); # global for better performance
 sub doFoo {
   my %args = @_;
   for my $key ( keys %args ) {
      die "invalid arg: $key" unless exists $doFoo_valid_args{ $key };
   }
   $args{a};
 }

Except that validation is _much_ faster, AND occurs at compile-time (where
possible).  Also, access to %args is significantly faster with
strict-hashes.

Because of the performance / storage benifits of this approach (see below),
it would be worth while reimplementing internal functions such as stat to
work as:

 sub stat {
  my %data : fields( ... ) = get-stat-data
  return wantarray ? values %data : \%data;
 }

 my $file_stat = stat( 'file' );
 my @file_stat = stat( 'file' );
 print "Size: $file_stat->{size}";

=head1 IMPLEMENTATION

Things to note:  

Typed-hashes can only be created at compile-time, and can not be altered,
much like a normal structure.  This is desirable for formal definitions and
for syntax checking.

High level representation of internal structure:

 struct hash:
   misc_info_t var_info   
   perl_hash_table_t data  


 struct strict_hash:
   misc_info_t var_info      # same as above
   perl_array[] key_names    # used for "keys" and "get-hash-as-array"
   perl_array[] values       # the actual field-values
   perl_hash_table_t mapping # optimized mapping of keys to positions

 struct strict_hash2:     # optional representation
   misc_info_t var_info
   string[] key_names     # only used by c-functions, so reduced overhead
   scalar[] values        # static c-array of scalars, no overhead of dynamic @array
   c_lookup_t mapping     # static DS that maps names to indexes

 struct strict_hash3:      # alternative representation
   misc_info_t var_info
   perl_array[] key_values # composit array ( @key_names, @values )
   c_lookup_t mapping

 struct strict_hash4:
   misc_info_t var_info
   scalar[] values
   c_tree_t mapping        # makes fetching keys slower, but can maintain order
                           # and alleviates redundant storage of key-names

   
Providing the seperate list of key-names greatly enhances performance for %
-> @, and also could allow extensions that could query attribute
information, such as

 my %hash : fields( a b );
 my $key_name = key_at_index( %hash, 1 ); # 'b'

If redundant data is undesirable, then there are three alternatives.  The
first is to remove the linear list of keys, since this can extracted from
the hash (with performance cost).  The other is to remove the hash mapping.
This would seriously hurt non-strict usage (by an unsuspecting module
user), though it can be avoided by the usage of typed-variables.  The third
is a hybrid:

Start with the linear list, since it ultimately uses less space.  As perl
compiles, if it never encounters a non-strict dynamic lookup, the hash
never needed to have been generated, both space and performance have been
conserved.  The first time the compiler notices the use of non-strict
dynamic lookup, it generates a mapping and stores it within the variable.
This way, performance is not hurt in the run time in either direction.

Since the information is always available to regenerate the mapping, we are not 
threatened by eval statements which may do things like:

 1 my %hash : fields( a b );
 2 my $field_name = "a";
 3 $hash{a};
 4 eval "\$hash{$field_name} for 0 .. 10000";
 5 eval '$hash{$field_name} for 0 .. 10000';

At line 1, we produce a strict hash, and maintain a compiler-based symbol
mapping table.  By compilation of like 5, we haven't noticed any dynamic
uses of the fields, so we free up memory by throwing away the internal
mapping.

By execution of line 4, we need to compile again.  We find that we have to
regenerate the mapping table so that we can convert interpolated 'a' into 0.  
The presence of evals is less common than that of hashes, so this
performance hit should be acceptible.  It might be better to not regenerate
the mapping, and simply perform a linear search for all compiled mappings.
Greater knowledge of the compiler internals is needed.  Execution of the
newly evaled line 4 is just as fast as execution of line 3. 

At execution of line 5, we find that compilation reveals a dynamic
hash-lookup of a strict hash that does not have a mapping.  At this point,
we generate a mapping.  Execution of the newly evaled line 5 then performs
a mapping from 'a' to 0 at a slightly higher cost than for normal
hash-lookup.  Line 4 is obviously faster than line 5.  If we had not used
line 5, then the above program would saved considerable space (assuming
that even a c-type hash takes up lots of room).  

Assuming that optimized c-array's are used, then this strict-hash _can_
take up even less room than @array.  This assumes that the overhead used in
storing field-names is less than that for dynamic perl arrays.  This
probably won't be the case, but it is certain that less room will be taken
up than with perl hashes.

Additionally, since compiler optimizations would allow direct access to
fixed sized c-arrays of perl-scalars, then strict-hashes might even work
faster than dynamic perl arrays.

We get the best of all worlds: We waste no unnecessary memory, have minor
additional compiler time / complexity, increase overall performance, and
enhance readibility of perl code.

The main question I have is what to do with the symbol-mapping table. Do
we:

=over 4

=item A

Always store the symbol mapping inside the data-structure (but just don't
provide it as perl-scalars)

=item B

Do as with the above and throw away the symbol-maping after each
compilation / eval stage.

=item C

Never store a mapping file and instead perform linear searches for symbols
since the number of symbols is so low that the process of check-summing a
symbol might outweight that of a 8 field linear-search.

=item D

Instead of using a hash, use a simple cost-effective binary tree.  The
storage requirements will definately be greater than that of an array, but
this is essentially a compromise between compiler time, memory storage, and
dynamic-run-time.  The down side is that dynamic lookup might be slower
with a tree than with the hash, and tree's normally require linked lists
instead of static arrays (more malloc's are required).  A hybrid Data
Structure could be created that used a static array for the tree, while
also providing optimized searches for ordered field-names (not typically
available from simple tree lookups).  I'd have to think more on this.
Worst case in this method is a linked-list lookup.

=back


=head1 REFERENCES

Perl5.005 pseudo-hashes

RFC 188: Objects : Private keys and methods

RFC 241: Pseudo-hashes must die!

RFC 122: types and structures

RFC 196: More direct syntax for hashes

RFC  75: structures and interface definitions

RFC 268: Keyed arrays

RFC 259: Builtins : Make use of hashref context for garrulous builtins
RFC 273 (v1) Internal representation of Pseudo-hashes using attributes.

Reply via email to