RFC 207 (v3) Arrays: Efficient Array Loops

Perl6 RFC Librarian Sat, 30 Sep 2000 23:10:18 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Arrays: Efficient Array Loops

=head1 VERSION

  Maintainer: Buddha Buck <[EMAIL PROTECTED]>
  Date: 8 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 207
  Version: 3
  Status: Frozen

=head1 ABSTRACT

This RFC proposes a notation for creating efficient implicit loops over
multidimensional arrays. It introduces the notation |i for an index
iterator for arrays, allowing temporary multidimensional arrays to be
created on the fly.

=head1 CHANGES

Version 3 freezes this RFC.  Except for some requests for
clarification from Jeremy Howard, there was no additional discussion
since Version 2.

Version 3 clarifies the use of looping indices in "list context", and
has a rewritten section concerning looping indices and RFC205
Cartesian products.  It also introduces |@ and |@foo notation for
flexibly dealing with looping indices for arrays of unknown dimension. 

Version 2 is an almost complete rewrite, based on concerns and
enhancements that arose in discussion.  I was also unhappy with the
language in the original.  The primary semantic changes are:

=over 4

=item * 

The syntax formerly known as "iterators" is now known as "looping
indices"

=item * 

|i is interpreted as "scalar-like", rather than "list-like". The
examples are now (to the best of my knowledge) consistant with that
interpretation.

=item *

The scoping mechanism is enhanced.  It is now clearer what the
iterators mean in the various contexts within perl.

=back

I believe the current version is much more understandable.

=head1 DESCRIPTION

Consider the problem of multiplying together two 2-dimensional tensors.  In
standard notation, this would be symbolized by

    Cijkl = Aij * Bkl

where the letters i, j, k and l are written as subscripts and represent
the indices of their respective tensors. To accomplish that same
multiplication in Perl, one needs to write (using RFC 204 notation):

    for my $i = (0..2) {    #assuming 3x3 tensors
      for my $j = (0..2) {
        for my $k = (0..2) {
          for my $l = (0..2) {
              $c[[$i,$j,$k,$l]] = $a[[$i,$j]] * $b[[$k,$l]] ;
          }
        }
      }  
   }

While this is not particularly difficult, it is clumsy, and slow. It could
be hidden in a subroutine, but then one lacks the flexibility to write
similar equations on the fly. For instance, transposition is easily written
as Tij = Aji. Such notation is assumed to be true for all legal values of 
i and j -- in effect, it loops over all legal values of i and j to compute
the results. Furthermore, such notation along with appropriate restrictions
on use could allow Perl to create optimised loops.

This RFC proposes a similar notation, using |name as a notation for
looping indices like i and j above.

=head2 Details

This RFC proposes that the entire The prefix | signifies that |i is an
"looping index" within the statement |i appears in.  More than one looping
index can appear within one statement.

The "scope" of a set of looping indices is the smallest subexpression in
the statement that contains all the looping indices in the statement.  If
the scope is not in list or void context, the scope is expanded until it is
in list or void context.  The scope of a set of looping indices will never
exceed one statement.

A set of looping indices generates an implicit loop (or nested loops)
surrounding the scope of the indices.

=head3 List Context

If used in list context, the implied loop creates a temporary array
whose dimensions and bounds are determined by the number and range of
the looping indices.

    # In list context
    $dotproduct = reduce ^_+^_, 0, $a[|i] * $b[|i];
    @tensorproduct = $a[[|i,|j]] * $b[[|k,|l]];

When there are multiple looping indices in an expression in list context,
the order of the indices in the temporary array is determined by scanning
the expression left-to-right.

As a specific example, the expression 

    @tensorproduct = $a[[|i,|j]] * $b[[|k,|l]];

would generate code somewhat equivilant to:

    { my sub loop {
             my @result :bounds(@#a,@#b);
             for my $i (0..$#a[0]) {
               for my $j (0..$#a[1]) {
                 for my $k (0..$#b[0]) {
                   for my $l (0..$#b[1] {
                     $result[[$i,$j,$k,$l]] = $a[[$i,$j]]*$b[[$k,$l]];
                   }
                 }
               }
             }
      }
      @tensorproduct = loop();
    }

If @a is a 2x3 array and @b is a 3x4 array, then @tensorproduct would
become a 2x3x3x4 array.  If the expression had been
C<$b[[|i,|j]]*$c[[|k,|l]]> instead, then @tensorproduct would have
become a 3x4x2x3 array.

=head3 Void Context

If used in a void context, the implied loop does not create a
temporary array, but rather expects the side effects (if any) to do
the real work.

    # In void context
    $c[[|i,|j,|k,|l]] = $a[[|i,|j]]*$b[[|k,|l]]; # compare with loop 
above

    $t[[|i,|j]] = $a[[|j,|i]];
    $product[[|i, |j]] = $a[[|i,|k]] * $b [[|k,|j]];
    $dotproduct += $a[|i] * $b[|i];  # tmtowdi
    $stack = $a[|i]; # $stack->STORE is TIEd to $stack->push($)

As the two example shows, assignment to a scalar is assumed to be in
scalar context, not list, and so the scope of the looping indices is
expanded to include  the LHS of the assignment.

=head3 User-defined ranges for looping indices

Looping indices can also have a user-defined range, using the syntax
"|i=@list".  This would create a bounding loop of the form "for |i
(@list) {...}" around the expression.  When no list explicitly given,
|i acts as if (0..) was the specified list.  The range for a looping
index can be specified anywhere in the statement, but only one range
can be given per looping index.

    # take the upper-triangle of a square array
    $uppertri[[|i,|j]] = 0;  # clear array to begin with
    $uppertri[[|i,|j]] = $a[[|i,|j=(0..|i)]];

Strictly speaking, |i does not take on all the values in @list (or 
(0..)),
but rather solely those values which lead to valid (in bounds) array
indices.

    # "unriffle" two arrays.  There are probably better ways to do this
    ($a[|i], $b[|i]) = ($c[ 2*|i ], $c[ 2*|i + 1 ]);
    $average[|i] = ($a[|i-1] + $a[|i] + $a[|i+1])/3;

In the first example, |i will never take on values that would cause
2*|i+1 to be out of bounds for $c.

=head3 Order of loop nesting

As mentioned above, using multiple looping indices  will cause a
nested loop.  The order of nesting the loops is not specified here,
but any interdependencies among the indices must be satisfied.

In most of the examples above, the loops caused by the multiple
iterators are independent.  However, in the "upper triangle" example,
since the range of |j depends on the current value of |i, |i must be
the "outer loop".

=head3 RFC205 Cartesian Products with iterators

RFC 205 proposes the use of ';' as an operator to generate Cartesian
products of list as a useful notation for generating complex array
slices.  An obvious question is how this notation works with
iterators.

If an array index contains a Cartesian product that uses one or more
looping indices, the Cartesian product is converted into an array
reference containing a looping index for every non-singleton list in
the Cartesian product before the loop scope is determined.

Example:  The expression

    $a[|i;(0..5)]

is initially converted to the form:

    $a[[|i,|_j=(0..5)]]

where C<|_j> is a newly created anonymous looping index.

This conversion is based on the "canonical" form of the RFC205
notation (i.e., without empty lists or leading or trailing "*" lists:

    $a[|i;]

is canonically

    $a[|i;(0..)]

which is converted to

    $a[[|i,|_j=(0..)]]

which is simplified to

    $a[[|i,|_j]]

The use of a leading or trailing * proceeds similarly:

    # Generalized tensor multiplication:
    @product = $a[|a0;*] * $b[|b0;*];

is canonically (assuming 4-dimensional tensors on the RHS):

    #generalized tensor multiplication (4-dim tensors)
    @product = $a[|a0;(0..);(0..);(0..)] * $b[|b0;(0..);(0..);(0..)];

which is converted to

    # Generalized tensor multiplication (4-dim example)
    @product = $a[[|a0,|_a1,|_a2,|_a3]] * $b[[|b0;|b1;|b2;|b3];

The use of ; also makes it easier to express long lists of looping
indices.  $array[|i;|j;|k] is equivilant to $array[[|i,|j,|k]], but
doesn't use as much punctuation.

=head3 |@ and |@foo as groups of iterators.

The RFC205 notation above is useful, but it does have some
limitations.  For instance, it is impossible to use it (or at least,
it requires more clever bits than I possess) to sum along the first
dimension of an arbitrary dimensioned array:

    # Do something completely bizzare
    $reduced[|j;*]  += $a[|i;|j;*];
    # which might be equivilant to:
    #
    # $reduced[|j;|_k;|_l;|_m] += $a[|i;|j;|_n];

In order to do this properly, there needs to be a way to state that
the iterators on the LHS are the same as the iterators on the RHS.
This is possible if the dimensions are known in advance:

    # sum along first dimension of 3-dim array, yielding 2-dim array
    $reduced[|j;|k] += $a[|i;|j;|k];

The notation |@ and |@foo can be used to get around that problem.  |@
is equivilant to a leading or trailing * in RFC205 notation, in that
it generates a group of anonymous looping indices.  |@foo names the 
group of
generated anonymous looping so that they can (as a group) be used in
multiple places:

    # sum along first dimension of n dim array, yielding n-1 dim array
    $reduced[|@f] += $a[|i;|@f];

    # perform tensor addition
    @sum = $a[|@a] + $b[|@a];
 
    # perform tensor multiplication
    @product = $a[|@a] * $b[|@b];
    # or
    @product = $a[|@] * $b[|@];

Each use of |@ is independent, so each factor in C<$a[|@] * $b[|@]> is
using a different set of anonymous looping indices.

|@ and |@foo are allowed only within array indices.
Because |@ and |@foo are of variable length, it must be possible to
determine from the expressions how large they are, knowing the shape
of the arrays.  Practically, this means that |@ can be used only once
in an index, because it is impossible to tell which dimension C<|i> is
in an expression like C<$a[|@;|i;|@]>.  An index may use multiple
named index groups, as long as other parts of the statement provide
enough context to provide the bounds of the index groups.

As yet another way to take a tensor product:

    $product[|@a;|@b] = $factor1[|@a] * $factor2[|@b];

demonstrates this.  Because the shape of @factor1 and @factor2
determined the number and bounds of the looping index groups |@a and
|@b, both can be used within the index for @product.

=head3 Looping indices outside of array indices.

Looping indices aren't restricted to being used solely as array
indices, as the "unriffle" example showed.  But each looping index has
to be used in an array index for at least one array.

    # find $nth triangular number
    my $triangle = 0;
    $triangle += |i=(0..$n);   # compile-time error: |i not used as index

    # Fill a multiplication table
    my @multtable : shape(12,12);
    $multtable[|i;|j] = |i*|j; # OK

=head Lazy Evaluation

Assuming that lazy evaluation is used in other parts of Perl6, it
would be nice if these loops could also be evaluated lazily.

In list context, this could be done by creating an anonymous function
to evaluate the looped expression at the desired indices:

    $a[|i]*$b[|j]  # in list context
    # becomes

    sub { my ($i,$j) = @_; $a[$i]*$b[$j]; }

This anonymous function can be TIEd to the resulting anonymous array,
so all array lookups would invoke this function.  Since TIEing is
supposed to be improved in Perl6, this would be a reasonable way to do
it.

If other lazy evaluation mechanisms work in Perl6, they could be used
instead.

I am uncertain if lazy evaluation makes sense in void context.

=head2 Examples:

    $t[[|i,|j]] = $a[[|j,|i]];  # transpose 2-d @a

would be equivilant to:

    {
      my $i; my $j
      for $i (0..) { # last if out-of-bounds
        for $j (0..) { # last if out-of-bounds
          $t[[$i,$j]] = $a[[$j,$i]];
        }
      }
    }

This notation also allows (as a specific use) an alternative notation
to the RFC 82 element-wise syntax.

   #compute pairwise sum, pairwise product, pairwise difference...
   @sum = @a[[|i,|j,|k,|l]] + @b[[|i;|j;|k;|l]];  # RFC82: @sum  = @a + @b
   @prod= @a[[|i,|j,|k,|l]] * @b[[|i;|j;|k;|l]];  #        @prod = @a * @b
   @diff= @a[[|i,|j,|k,|l]] - @b[[|i;|j;|k;|l]];  #        @diff = @a - @b

RFC 82 syntax is simpler, but this is perl, so There Is More Than One
Way To Do It, as tensor multiplication demonstrates.

Note that if the "Lazy Evaluation" schema mentioned above is adopted,
then these sums, products, and differences could be automagically lazy
as well.

=head1 IMPLEMENTATION

The simplest implementation would be to convert at compile-time (or
parse time) void-context looped iterator scopes to loops analogous to
the above examples, and convert list-context looped iterator scopes to
valued do-blocks or invoked anonymous subroutines:

    $dotproduct = reduce {^_+^_},0,$a[|i]*$b[|i];

    # would be transformed into

    $dotproduct = reduce {^_+^_},0,
       sub { my $i; my @r;
             for $i (0..min($#a,$#b)) {
               $r[$i] = $a[$i] * $b[$i];
             }
             return @r;
           }->();

A more sophisticated, preferred, implementation would take advantage
of the static, known nature of the data to create a highly optimized
version of the loop.

Possible optimizations include: Common sub-expression elimintation,
encoding internally to some non-interpreted looping construct, etc. If
special 'numeric functions' are provided in Perl, then expressions
with just unoverloaded operators and numeric functions could be
optimised into tight compiled loops, as occurs for example with
fromfunction() and ufuncs in Numeric Python:

    http://starship.python.net/~da/numtut/array.html#SEC8
    http://starship.python.net/~da/numtut/array.html#SEC13

For lazy evaluation, the value of the expression at any given set of
indices is easy to calculate. However the lazy evaluation mechanism 
works,
it can use this property to calculate the appropriate values.

=head1 REFERENCES

RFC 203: Notation for declaring and creating arrays

RFC 204: Notation for indexing arrays with an LOL as an index

RFC 205: New operator ';' for creating array slices
RFC 207 (v3) Arrays: Efficient Array Loops

Reply via email to