This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Arrays: Efficient Array Loops =head1 VERSION Maintainer: Buddha Buck <[EMAIL PROTECTED]> Date: 8 Sep 2000 Last Modified: 30 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 207 Version: 3 Status: Frozen =head1 ABSTRACT This RFC proposes a notation for creating efficient implicit loops over multidimensional arrays. It introduces the notation |i for an index iterator for arrays, allowing temporary multidimensional arrays to be created on the fly. =head1 CHANGES Version 3 freezes this RFC. Except for some requests for clarification from Jeremy Howard, there was no additional discussion since Version 2. Version 3 clarifies the use of looping indices in "list context", and has a rewritten section concerning looping indices and RFC205 Cartesian products. It also introduces |@ and |@foo notation for flexibly dealing with looping indices for arrays of unknown dimension. Version 2 is an almost complete rewrite, based on concerns and enhancements that arose in discussion. I was also unhappy with the language in the original. The primary semantic changes are: =over 4 =item * The syntax formerly known as "iterators" is now known as "looping indices" =item * |i is interpreted as "scalar-like", rather than "list-like". The examples are now (to the best of my knowledge) consistant with that interpretation. =item * The scoping mechanism is enhanced. It is now clearer what the iterators mean in the various contexts within perl. =back I believe the current version is much more understandable. =head1 DESCRIPTION Consider the problem of multiplying together two 2-dimensional tensors. In standard notation, this would be symbolized by Cijkl = Aij * Bkl where the letters i, j, k and l are written as subscripts and represent the indices of their respective tensors. To accomplish that same multiplication in Perl, one needs to write (using RFC 204 notation): for my $i = (0..2) { #assuming 3x3 tensors for my $j = (0..2) { for my $k = (0..2) { for my $l = (0..2) { $c[[$i,$j,$k,$l]] = $a[[$i,$j]] * $b[[$k,$l]] ; } } } } While this is not particularly difficult, it is clumsy, and slow. It could be hidden in a subroutine, but then one lacks the flexibility to write similar equations on the fly. For instance, transposition is easily written as Tij = Aji. Such notation is assumed to be true for all legal values of i and j -- in effect, it loops over all legal values of i and j to compute the results. Furthermore, such notation along with appropriate restrictions on use could allow Perl to create optimised loops. This RFC proposes a similar notation, using |name as a notation for looping indices like i and j above. =head2 Details This RFC proposes that the entire The prefix | signifies that |i is an "looping index" within the statement |i appears in. More than one looping index can appear within one statement. The "scope" of a set of looping indices is the smallest subexpression in the statement that contains all the looping indices in the statement. If the scope is not in list or void context, the scope is expanded until it is in list or void context. The scope of a set of looping indices will never exceed one statement. A set of looping indices generates an implicit loop (or nested loops) surrounding the scope of the indices. =head3 List Context If used in list context, the implied loop creates a temporary array whose dimensions and bounds are determined by the number and range of the looping indices. # In list context $dotproduct = reduce ^_+^_, 0, $a[|i] * $b[|i]; @tensorproduct = $a[[|i,|j]] * $b[[|k,|l]]; When there are multiple looping indices in an expression in list context, the order of the indices in the temporary array is determined by scanning the expression left-to-right. As a specific example, the expression @tensorproduct = $a[[|i,|j]] * $b[[|k,|l]]; would generate code somewhat equivilant to: { my sub loop { my @result :bounds(@#a,@#b); for my $i (0..$#a[0]) { for my $j (0..$#a[1]) { for my $k (0..$#b[0]) { for my $l (0..$#b[1] { $result[[$i,$j,$k,$l]] = $a[[$i,$j]]*$b[[$k,$l]]; } } } } } @tensorproduct = loop(); } If @a is a 2x3 array and @b is a 3x4 array, then @tensorproduct would become a 2x3x3x4 array. If the expression had been C<$b[[|i,|j]]*$c[[|k,|l]]> instead, then @tensorproduct would have become a 3x4x2x3 array. =head3 Void Context If used in a void context, the implied loop does not create a temporary array, but rather expects the side effects (if any) to do the real work. # In void context $c[[|i,|j,|k,|l]] = $a[[|i,|j]]*$b[[|k,|l]]; # compare with loop above $t[[|i,|j]] = $a[[|j,|i]]; $product[[|i, |j]] = $a[[|i,|k]] * $b [[|k,|j]]; $dotproduct += $a[|i] * $b[|i]; # tmtowdi $stack = $a[|i]; # $stack->STORE is TIEd to $stack->push($) As the two example shows, assignment to a scalar is assumed to be in scalar context, not list, and so the scope of the looping indices is expanded to include the LHS of the assignment. =head3 User-defined ranges for looping indices Looping indices can also have a user-defined range, using the syntax "|i=@list". This would create a bounding loop of the form "for |i (@list) {...}" around the expression. When no list explicitly given, |i acts as if (0..) was the specified list. The range for a looping index can be specified anywhere in the statement, but only one range can be given per looping index. # take the upper-triangle of a square array $uppertri[[|i,|j]] = 0; # clear array to begin with $uppertri[[|i,|j]] = $a[[|i,|j=(0..|i)]]; Strictly speaking, |i does not take on all the values in @list (or (0..)), but rather solely those values which lead to valid (in bounds) array indices. # "unriffle" two arrays. There are probably better ways to do this ($a[|i], $b[|i]) = ($c[ 2*|i ], $c[ 2*|i + 1 ]); $average[|i] = ($a[|i-1] + $a[|i] + $a[|i+1])/3; In the first example, |i will never take on values that would cause 2*|i+1 to be out of bounds for $c. =head3 Order of loop nesting As mentioned above, using multiple looping indices will cause a nested loop. The order of nesting the loops is not specified here, but any interdependencies among the indices must be satisfied. In most of the examples above, the loops caused by the multiple iterators are independent. However, in the "upper triangle" example, since the range of |j depends on the current value of |i, |i must be the "outer loop". =head3 RFC205 Cartesian Products with iterators RFC 205 proposes the use of ';' as an operator to generate Cartesian products of list as a useful notation for generating complex array slices. An obvious question is how this notation works with iterators. If an array index contains a Cartesian product that uses one or more looping indices, the Cartesian product is converted into an array reference containing a looping index for every non-singleton list in the Cartesian product before the loop scope is determined. Example: The expression $a[|i;(0..5)] is initially converted to the form: $a[[|i,|_j=(0..5)]] where C<|_j> is a newly created anonymous looping index. This conversion is based on the "canonical" form of the RFC205 notation (i.e., without empty lists or leading or trailing "*" lists: $a[|i;] is canonically $a[|i;(0..)] which is converted to $a[[|i,|_j=(0..)]] which is simplified to $a[[|i,|_j]] The use of a leading or trailing * proceeds similarly: # Generalized tensor multiplication: @product = $a[|a0;*] * $b[|b0;*]; is canonically (assuming 4-dimensional tensors on the RHS): #generalized tensor multiplication (4-dim tensors) @product = $a[|a0;(0..);(0..);(0..)] * $b[|b0;(0..);(0..);(0..)]; which is converted to # Generalized tensor multiplication (4-dim example) @product = $a[[|a0,|_a1,|_a2,|_a3]] * $b[[|b0;|b1;|b2;|b3]; The use of ; also makes it easier to express long lists of looping indices. $array[|i;|j;|k] is equivilant to $array[[|i,|j,|k]], but doesn't use as much punctuation. =head3 |@ and |@foo as groups of iterators. The RFC205 notation above is useful, but it does have some limitations. For instance, it is impossible to use it (or at least, it requires more clever bits than I possess) to sum along the first dimension of an arbitrary dimensioned array: # Do something completely bizzare $reduced[|j;*] += $a[|i;|j;*]; # which might be equivilant to: # # $reduced[|j;|_k;|_l;|_m] += $a[|i;|j;|_n]; In order to do this properly, there needs to be a way to state that the iterators on the LHS are the same as the iterators on the RHS. This is possible if the dimensions are known in advance: # sum along first dimension of 3-dim array, yielding 2-dim array $reduced[|j;|k] += $a[|i;|j;|k]; The notation |@ and |@foo can be used to get around that problem. |@ is equivilant to a leading or trailing * in RFC205 notation, in that it generates a group of anonymous looping indices. |@foo names the group of generated anonymous looping so that they can (as a group) be used in multiple places: # sum along first dimension of n dim array, yielding n-1 dim array $reduced[|@f] += $a[|i;|@f]; # perform tensor addition @sum = $a[|@a] + $b[|@a]; # perform tensor multiplication @product = $a[|@a] * $b[|@b]; # or @product = $a[|@] * $b[|@]; Each use of |@ is independent, so each factor in C<$a[|@] * $b[|@]> is using a different set of anonymous looping indices. |@ and |@foo are allowed only within array indices. Because |@ and |@foo are of variable length, it must be possible to determine from the expressions how large they are, knowing the shape of the arrays. Practically, this means that |@ can be used only once in an index, because it is impossible to tell which dimension C<|i> is in an expression like C<$a[|@;|i;|@]>. An index may use multiple named index groups, as long as other parts of the statement provide enough context to provide the bounds of the index groups. As yet another way to take a tensor product: $product[|@a;|@b] = $factor1[|@a] * $factor2[|@b]; demonstrates this. Because the shape of @factor1 and @factor2 determined the number and bounds of the looping index groups |@a and |@b, both can be used within the index for @product. =head3 Looping indices outside of array indices. Looping indices aren't restricted to being used solely as array indices, as the "unriffle" example showed. But each looping index has to be used in an array index for at least one array. # find $nth triangular number my $triangle = 0; $triangle += |i=(0..$n); # compile-time error: |i not used as index # Fill a multiplication table my @multtable : shape(12,12); $multtable[|i;|j] = |i*|j; # OK =head Lazy Evaluation Assuming that lazy evaluation is used in other parts of Perl6, it would be nice if these loops could also be evaluated lazily. In list context, this could be done by creating an anonymous function to evaluate the looped expression at the desired indices: $a[|i]*$b[|j] # in list context # becomes sub { my ($i,$j) = @_; $a[$i]*$b[$j]; } This anonymous function can be TIEd to the resulting anonymous array, so all array lookups would invoke this function. Since TIEing is supposed to be improved in Perl6, this would be a reasonable way to do it. If other lazy evaluation mechanisms work in Perl6, they could be used instead. I am uncertain if lazy evaluation makes sense in void context. =head2 Examples: $t[[|i,|j]] = $a[[|j,|i]]; # transpose 2-d @a would be equivilant to: { my $i; my $j for $i (0..) { # last if out-of-bounds for $j (0..) { # last if out-of-bounds $t[[$i,$j]] = $a[[$j,$i]]; } } } This notation also allows (as a specific use) an alternative notation to the RFC 82 element-wise syntax. #compute pairwise sum, pairwise product, pairwise difference... @sum = @a[[|i,|j,|k,|l]] + @b[[|i;|j;|k;|l]]; # RFC82: @sum = @a + @b @prod= @a[[|i,|j,|k,|l]] * @b[[|i;|j;|k;|l]]; # @prod = @a * @b @diff= @a[[|i,|j,|k,|l]] - @b[[|i;|j;|k;|l]]; # @diff = @a - @b RFC 82 syntax is simpler, but this is perl, so There Is More Than One Way To Do It, as tensor multiplication demonstrates. Note that if the "Lazy Evaluation" schema mentioned above is adopted, then these sums, products, and differences could be automagically lazy as well. =head1 IMPLEMENTATION The simplest implementation would be to convert at compile-time (or parse time) void-context looped iterator scopes to loops analogous to the above examples, and convert list-context looped iterator scopes to valued do-blocks or invoked anonymous subroutines: $dotproduct = reduce {^_+^_},0,$a[|i]*$b[|i]; # would be transformed into $dotproduct = reduce {^_+^_},0, sub { my $i; my @r; for $i (0..min($#a,$#b)) { $r[$i] = $a[$i] * $b[$i]; } return @r; }->(); A more sophisticated, preferred, implementation would take advantage of the static, known nature of the data to create a highly optimized version of the loop. Possible optimizations include: Common sub-expression elimintation, encoding internally to some non-interpreted looping construct, etc. If special 'numeric functions' are provided in Perl, then expressions with just unoverloaded operators and numeric functions could be optimised into tight compiled loops, as occurs for example with fromfunction() and ufuncs in Numeric Python: http://starship.python.net/~da/numtut/array.html#SEC8 http://starship.python.net/~da/numtut/array.html#SEC13 For lazy evaluation, the value of the expression at any given set of indices is easy to calculate. However the lazy evaluation mechanism works, it can use this property to calculate the appropriate values. =head1 REFERENCES RFC 203: Notation for declaring and creating arrays RFC 204: Notation for indexing arrays with an LOL as an index RFC 205: New operator ';' for creating array slices