RFC 147 (v1) Split Scalars and Objects/References into Two Types

Perl6 RFC Librarian Thu, 24 Aug 2000 08:48:39 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Split Scalars and Objects/References into Two Types

=head1 VERSION

   Maintainer: Nathan Wiger <[EMAIL PROTECTED]>
   Date: 23 Aug 2000
   Version: 1
   Mailing List: [EMAIL PROTECTED]
   Number: 147
   Status: Developing

=head1 ABSTRACT

Many RFC's have been proposed dealing with Perl's current prefixing
system. Many have criticized it, even suggesting to drop prefixes
altogether, but the vast majority of these criticisms center around one
thing: Not being able to tell the difference between "true" scalars and
references/objects.  

The answer to this has frequently been that the $scalar type defines not
a quality, but a quantity: one.  While true in one sense, Perl does have
both @arrays and %hashes, notations which do the opposite: define a
quality.  While they are both composed of multiple things, those things
are fundamentally different, and hence the need for two notations.

References and true scalars ("scalars" henceforth) are also
fundamentally different things. Therefore, this RFC proposes that they
be split into two separate types in Perl 6: $scalars and *references.  

Please take the time to read this. It is a thorough document that
explores both the pros AND cons of this approach. 

=head1 DESCRIPTION

Please note that the * is used as the reference prefix in this
document.  Nonetheless, this proposal has NOTHING to do with typeglobs,
and using the * prefix would require typeglobs to disappear. The prefix
is debatable, let's focus on the idea.

=head2 The Problem

Everyone on this list should be familiar with the problem.  You can't
tell scalars and references apart by looking at them. They are
completely ambiguous. Consider:

   $stuff{key}      # hash value
   $stuff[0]        # array value
   $stuff           # scalar or reference?

   %stuff           # hash
   @stuff           # array
   $stuff           # scalar or reference?

You get the idea. In simple examples, this ambiguity goes away because
of the way you address them:

   $stuff           # scalar or reference?
   $stuff->{key}    # hash reference
   $stuff->[0]      # array reference
   $stuff->func     # object reference

Once we use them, we recognize that the last three are references.
However, there are many situations when we can't disambiguate them:

   $stuff = Class->method    # scalar or reference?
   read_data($stuff)         # scalar or reference?
   $$stuff                   # scalar or reference?

Good variable names help, but the fact remains that these are
fundamentally different types. For example, you know what will happen
when you do this:

   print @stuff              # the contents of @stuff   
   while(($k,$v) = each %h)) # each key/value of hash %h

But what about these examples?

   $z = $x + $y              # numeric addition?
   $stuff ||= "Bob"          # is this correct?
   print ref $stuff          # sure about that?

Anyone who has worked on a large project in Perl will relate to the fact
that even this code block is difficult to read unless you know exactly
what the functions are supposed to return:

   $c = Class->new;              # ok (maybe) so far...
   ($u, $g) = $c->get_user;
   ($d, $f, $j, $m) = $c->read_data;

   # ... lots of code passes ...

   print "Thanks $u for your ", $g->purchase;
   $member = $d || $CLASS_DEFAULT;      # scalars or ????
   while (($k,$v) = each %$f) {
       print "Now updating info for $k...\n";
       push @$j, @{ $v->{'personal'} };
       $name = $member->name($$v->{name}->{first}) || "unknown";
   } 
   print "Your confirmation number is $m\n";

Here, we've lost the advantage of prefixes. While there are lots of $'s
in the code, they don't tell us anything useful other than "there's a
single something here". [1]

However, note that the @ and % still do, because they indicate quality
in addition to quantity.

=head2 The Proposed Solution

This RFC proposes that references and scalars be split into two distinct
types. This is not a novel approach.  It is also not a simple one. There
are many issues to address.

However, the dividing line is clear:

   Type              Definition
   ----------------- ----------------------------------
   $scalar           Holds a single simple value, such
                     as a string or number

   *reference        Holds a single complex value, such
                     as a hash reference or object

Like %hashes versus @arrays, these two new types have the advantage that
they can describe both a quantity and a quality.  RFC 23, "Higher order
functions" adds a new variable prefix, ^, because it introduces a
fundamental new type: the placeholder.  The same idea should apply here
as well.

Consider the above sample code now:

   *c = Class->new; 
   ($u, *g) = *c->get_user;
   (*d, *f, *j, $m) = *c->read_data;

   # ... lots of code passes ...

   print "Thanks $u for your ", *g->purchase;
   *member = *d || *CLASS_DEFAULT;     # oh, they're refs!
   while (($k,*v) = each %*f) {
       print "Now updating info for $k...\n";
       push @*j, @{$*v{personal}};
       $name = *member->name($*v{name}{first}) || "unknown";
   } 
   print "Your confirmation number is $m\n";

We now have a much better idea of what's going on. For one thing, we can
instantly tell the difference between our $scalar assignments and our
*reference assignments.  We can now be 100% sure that $name really holds
a scalar, and *c really holds a reference of some type. [2]

Moreover, with the complete syntax described below, we can actually tell
our objects and references apart more easily than with the current "all
refs use ->" approach.

=head2 Syntax

Basically, all references and objects will be prefixed with a *. The
single * will take the place of the current $-> combination syntax
required for references. All scalars
will continue to be prefixed with a $.

This table should help illustrate the new syntax:

  Perl 5                   Perl 6
  ------------------------ --------------------------
  $x = 5;                  $x = 5;
  $name = "Nate";          $name = "Nate";

  $_ = 1;                  $_ = 1;
  $_->func;                *_->func;     # new *_ anon ref

  # References
  $n = \$name;             *n = $name;   # \ redundant
  print $$n;               print $*n; 

  $hash{key} = $val;       $hash{key} = $val;
  $hash = \%hash;          *hash = %hash;
  $hash->{key}             $*hash{key}   # -> redundant
  
  $array[3] = "Nate";     $array[3] = "Nate";
  $aref = \@array;        *aref = @array;
  $aref->[0]              $*aref[0]    # -> redundant
  $aref->[0][1]           $*aref[0][1]
  @{$aref->[0][1]}        @{*aref[0][1]}

  # Objects -- keep -> to make objects stand out
  $c = Class->func;        *c = Class->func;
  [no equivalent]          $c = Class->func;
  $r = $c || $DEFAULT;     *r = *c || *DEFAULT;
  print $c->name;          print *c->name;   # $*c->name implied
  print @{$c->name};       print @{*c->name};

  # Combinations
  $val = $c->func || $D->{VAL};
                           $val = *c->func || $*D{VAL};

  $name = $h->{name} || $a->[3][1];
                           *name = *h{name} || *a[3][1]

Notice this buys us another key advantage. Since the * delimits
references, we no longer are forced to use -> in all references. As
such, -> can now be reserved solely for objects, allowing us to easily
locate our objects now.

The exact syntax is still open for discussion. There are many more
clarifying examples below as well.

=head2 The Advantages

There are many advantages to this approach. There are also potential
problems, which will be addressed second.

=head3 Ability to differentiate types

As mentioned above, this is the key advantage to this approach. Not only
is the code clearer, but we can now actually tell what we want from a
function without complex casting or syntax rules:

   *object  =  date;     # object with accessor functions
   $scalar  =  date;     # scalar ctime date string

Normally, the date() function would either have to choose what to
return, or would require specific casting in order to return the correct
thing. And, once it's returned, later in the code you still can't easily
tell what you got. Differentiating between *references and $scalars
allows us to do both.

Plus, as the table above shows, many references (such as
multidimensional array references) become much easier to read.

=head3 Integration with want() to return refs

Currently, the new want() proposed by RFC 21 relies on odd syntax to
differentiate between hashrefs and arrayrefs:

   $arrayref = @{func()};
   $hashref  = %{func()};

While this works for simple situations, it requires some contortions in
complex situations:

   $hashref  = %{func() || $def || $r->getdata()};

By differentiating our references with this new syntax, we can move this
to the left side of the assignment, which is where it belongs:

   @*arrayref = func();
   %*hashref  = func();
   %*hashref  = func() || *def || *r->getdata();

The code is cleaner and easier to read, and can more more easily span
multiple lines. [4] When integrated with the previous point, we can now
easily tell apart objects and refs even in assignment contexts.

=head3 Leftmost $ tells true context

Currently, if the leftmost prefix is a $, the actual context is still
ambiguous:

   print $$$$name;    # ref or scalar?
   $d =  $$$$name;    # ref or scalar?

However, with our new syntax, $ is only used to delimit an actualy
scalar. So:

   print $***name;    # scalar
   *d =  ****name;    # ref

This is also extended to arrays and hashes:

   @***name     # array
   %***name     # hash

Note that this syntax is less confusing than this:

   @$$$name
   %$$$name

Where you have to I<infer> that $$$name is a ref. The *, by contrast,
tells us this explicitly.

=head3 No Left/Right cross-scanning to determine type

Currently, if you're reading a complex type in Perl you have to scan
left to right to figure out it's a ref:

   @{ $r->{$key} }

You first see the @, which tells you it's an array. However, the next
thing you see is a $, which leads you to believe it's a scalar, until
you see the ->. In this simple isolated example, it's not too bad.  But
bury this in a chunk of code, and it's very hard to read and use.

Contrast this with the new syntax:

   @{*r{$key}}

Not only is this more compact, but by simply reading left to right you
can determine it's an array of a ref to a hash.

=head3 Implicit type checking

Currently, you can get to an individual element of an array or hash with
the syntax:

   $array[0]       # first scalar or ref of @array
   $hash{key}      # scalar or ref indexed by "key"

Differentiating the two would require that the prefix also reflect the
return value:

   $array[0]       # first element (scalar) of @array
   *array[0]       # first element (ref) of @array
   $hash{key}      # "key"-indexed element (scalar) of %hash
   *hash{key}      # "key"-indexed element (ref) of %hash

The structure of @array and %hash would remain identical, in that each
slot could only hold one element (so $array[0] and *array[0] would point
to the same data, only "casting" them differently). However, you'd have
to make sure that you called them with the correct prefix:

   my @array;               # ok
   *array[0] = new Class;   # ok
   $array[0]->func();       # wrong
   *array[0]->func();       # ok
   $stuff = *array[0];      # wrong
   $stuff = $array[0];      # ok (but empty)

This actually gives us a key advantage. Here, Perl can now return the
value if it's the correct type, or undef otherwise. As such:

   $array[3] = "Nate";      # ok
   print $array[3];         # ok
   *array[3]->func();       # *array[3] returns undef

So, * is acting a little like an implicit ref call. This means that
we'll be guaranteed to be getting what we want without nasty ref checks
all over the place. It also means that our error messages can be
clearer:

   array[3] is a scalar value, not a reference at line 3.

Note that this approach is analagous from a user level to how hashes and
arrays work as well:

   $array[3] = "Nate";      # ok
   print $array[3];         # ok
   print $array{3};         # undef

Just like [] and {} are used to get at different types of values, so too
are $ and * now used to get to different types of values.

=head3 Ability to avoid returning objects 

RFC 49, "Objects should have builtin stringifying STRING method", and
RFC 73, "All Perl core functions should return objects", together
describe an interface which could allow easily embedded core objects
with true polymorphic capabilities. 

For example, using these methods the date() function would not have to
choose what to return in a scalar context. Instead, an object would be
returned which could morph into a "true" scalar value as needed. This
approach is quite powerful in embedding objects at a low-level.

However, there are some problems. For one thing, sometimes you really
want an actual scalar:

   $x = Math->calc;   # create $x object
   $z = $x * $y;      # cast via $x->NUMBER to scalar
   $m = $x;           # object reassignment (oops?)

If $x is a polymorphic object, there are a couple of disadvantages:

  1. There's a lot of overhead in maintaining objects.

  2. Lines 2 and 3 work completely differently, which
     is counterintuitive.

  3. You still can't tell what's an object and what's
     a scalar by looking at the code.

By differentiating the types, however, we can explictly ask for what we
want to get back:

   *x = Math->calc;   # same function overloaded to
   $x = Math->calc;   # return both
   $z = $x * $y;      # true scalar math
   *m = *x;           # true reference assignment

Now it's obvious what's going on. We know what we're getting, we know
what we are passing around, and there's no wasted overhead for passing
objects around if we don't need to.

=head3 Better cognitive properties

Perl prides itself on being a human-friendly language.  Indeed, it is.
And fun! Mostly.

However, trying to figure out what's a scalar and what's an object just
by looking at a whole bunch of $'s is not fun. And let's face it, most
people regard single values and references to be totally different
things. And they are, just like arrays and hashes are totally different.

Many Perl programmers I've taught are quite surprised the first (and
second, and third, and fourth) time they see something like this:

   $cgi = new CGI || $oldcgi;
   $user = $cgi->param($name) || $ENV{REMOTE_USER};
   print "Welcome $user, to " . $cgi->server_name;

This is confusing. Only $scalars can play such double-duty.  %hashes and
@arrays can only hold hashes and arrays. But Perl 5 $scalars can
currently hold values or references, giving Perl 5 OO a "bolted on the
side" appearance. [3]

While we have become accustomed to looking at it, the simple fact is
it's confusing. The same "type" is being used for things with different
qualities, and it's just as confusing as if arrays and hashes had the
same prefix.

Now, consider this code:

   *cgi = new CGI || *oldcgi;
   $user = *cgi->param($name) || $ENV{REMOTE_USER};
   print "Welcome $user, to " . *cgi->server_name;

This is less confusing. It is now apparent where our *objects are and
where our $scalars are. The code is more easily readable and can also be
more specific in what types it is requesting. You can instantly pick out
your *objects and $scalars at a glance.

=head3 Increased namespace

This is a simple one, but should not be overlooked. With a separate *
type, we now have to ability to have both $name and *name, where $name
could be the actual value and *name could be a reference to $name.
Increases in namespace are basically always beneficial.

=head2 Potential Problems

This approach is not without its problems. These need to be carefully
addressed before this can be implemented.

=head3 Iterating through a mixed array

Consider an array with mixed references and scalars:

   @mixed = qw($x $y *cgi $z *apache);

A Perl 5 loop would do something like this (remember all the prefixes
will be $ unlike what's shown above):

   for ( @mixed ) {
      if ( ref $_ ) {
         $name .= $_->func;
      } else {
         $sum += $_;
      }
   }

Now consider the Perl 6 for loop over it, where the prefixes are mixed

   for ( @mixed ) {
       if ( $_ ) {   # uh-oh
       # ...
   }

The problem is that the builtin anonymous scalar $_ can now only handle
scalars, since we've separated the types. Luckily, we have already added
a new anonymous reference, *_. As such, our for loop can now obey the
following rules:

   1. $_ and *_ are initialized to undef each loop

   2. only the appropriate one is set depending on
      whether the element is a scalar or reference

So, our for loop becomes:

   for ( @mixed ) {
      if ( *_ ) {
         $name .= *_->func;
      } else {
         $sum += $_;
      }
   }

Note that the length of the code is exactly the same, and that the test
no longer requires a ref() call.  In this specific example, I would
argue this could be seen as a benefit, because you can use it to only
address certain elements of the array:

   for ( @mixed ) {
      $sum += $_ if $_;    # only address true scalars
   }

This will implicitly only get true scalars, since all the references
will be placed in *_ (and the loop ignores them).

However, the situation gets ugly when you want to name your loop
parameters:

   for my $line ( @mixed ) {
      # uh-oh
   }

Now, we've explicitly told the for loop to use $line instead of $_. But
this can only hold scalars, not references. There are two possibile
solutions:

   1. Put the scalars in $line, but the references
      in *_ still (and vice versa if they had said
      "my *line").

   2. Automatically convert the name of $line to
      *line if necessary.

   3. Do #1, but require them to use a HOFN if they
      want automatic conversion (i.e., "^line")

None of these is a perfect solution. This is an issue that needs further
examination. Note that an analagous problem exists with mixed hashes.

=head3 Conclusions

Differentiating scalars and references gives us many key advantages:

   1. Ability to differentiate types in all contexts

   2. More compact, readable, and flexible syntax

   3. Leftmost $ tells true context

   4. Implicit type checking

   5. Better cognitive properties

   6. Increased namespace

Initially, I started writing this RFC as an idea that should be at least
considered, but not necessarily one that I thought would really benefit
Perl 6.

However, I have since completely changed my mind. As I went through the
process, I found that there were many, many deficiencies with the
current "stick everything in $" philosophy of Perl 5. I believe that
this proposal could benefit Perl 6 tremendously.

=head1 IMPLEMENTATION

If this gets accepted, it will require a massive rewrite of Perl's
internals. Thank heavens we're doing that anyways. :-) 

=head1 NOTES

[1] The "single" part isn't true, either, since $arrayrefs and $hashrefs
in fact point to multiple values.  Some will counter with the idea that
$arrayrefs and $hashrefs actually point to "single" arrays and hashes.
However, this is a stretch in the author's opinion.

[2] Opponents might poke at this, claiming that you only know *
represents a reference "of some type". What next, individual types for
all the different numbers, filehandles, and so forth? However, this is a
slippery slope fallacy.  Numbers and strings are types of scalars, just
like objects and filehandles are types of references.

[3] Andy Wardley came up with this one. :-)

[4] Note that this does not have to replace the syntax in RFC 21, but
can serve as an alternative form.

=head1 MIGRATION

The p52p6 translator would have to convert all Perl 5 reference and
object notations into the new Perl 6 syntax. This is not actually as
hard as it sounds since a few regexp's would do the trick for
references. Objects are trickier and would require some backtracking
through the code.

=head1 REFERENCES

RFC 49: Objects should have builtin stringifying STRING method 

RFC 73: All Perl core functions should return objects 

RFC 9: Highlander Variable Types 

RFC 109: Less line noise - let's get rid of @% 

RFC 95: Object Classes 

Upcoming RFC by brian d foy and myself on polymorphic objects

RFC 21: Replace C<wantarray> with a generic C<want> function 

RFC 23: Higher order functions
RFC 147 (v1) Split Scalars and Objects/References into Two Types

Reply via email to