This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Efficient numerics with perl =head1 VERSION Maintainer: pdl-porters team <[EMAIL PROTECTED]> Date: 16 August 2000 Version: 1 Mailing List: [EMAIL PROTECTED] Number: 116 =head1 ABSTRACT This RFC describes basic requirements of a numerical object/variable type for perl. A reference implementation (that could almost certainly do with some syntactical sweetening) is provided. It is currently available as the PDL distribution. A few key concepts of the reference implementation will be described. Any efforts to get numerical support into the perl core should provide means to either easily hook a numerical core into perl to achieve this or supply the features at the core design level. Current implementation aspects are sketched. It is also the pdl-porters' contribution to I<overstuff everybodies brains> [1]. =head1 DESCRIPTION PDL provides an object class to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing. Apart from the object representation itself PDL implements three key concepts which its users have found critical for efficient and enjoyable numerical computing with perl: =over 3 =item * slices or I<smart references> =item * PDL threading -- automated optimized looping over elementary operations =item * Code generation and/or algorithmic abstraction =back =head2 Compact typed multi-dimensional arrays The basic data object of PDL is a zero- to multidimensional array of numbers, all of the same data type (for efficiency reasons, it is desirable to store the numbers in the native binary format and consecutively in memory instead of having a Perl SV for each one). Perl6 could come up with a representation for such types as part of its core (or be sufficiently flexible in its SV design). It is also critical to be able to specify a range of types, for example PDL offers support for single and double precision (float/double) as well as a variety of integer types ('byte', 'ushort', 'short', 'long'). A future implementation should support a complex type at the core level. In the current implemenetation this is difficult due to parser limitations of the code generator (C<PDL::PP>) and C's lack of an intrinsic complex type. =head2 Smart references and transformations: slicing and dicing During data analysis it is often desirable to be able to refer to parts of the data, for example I<every other number in this vector> without allocating new storage: if the original vector is 100Mb long, we don't want to allocate and copy the data for another 50Mb vector. For this purpose PDL supports "virtual piddles", which are pointers to parts of other piddles (the implementation is transparent, virtual piddles appearing just as any other piddle object to the user). $a = zeroes(1000,1000,100); # more than 100MB worth of data $b = $a->slice('0:-1:2'); # virtual piddle of every 2nd element $b *= 2; # changes the elements in $a C<slice> is just one example of this type of operation. C<$b> contains only information how to access C<$a>'s data rather than a copy. Similarly, a transposed object with the first two dimensions exchanged is easily obtained: $c = $a->xchg(0,1); # exchange dimension 0 and 1 Again, C<$c> only contains information how to access C<$a>'s data. Similarly we can create an index: $i = pdl(1,9,27); # a number of indices $c = $a->index($i); Changing C<$c> changes C<$a> (and vice versa). However in this case PDL makes no nice optimisation and manipulating C<$c> uses memory for temps even though the changes are propagated back. The crucial point is that C<$c> and C<$a> are tightly linked objects that propagate changes in either to the other (in a way similar to dataflow). Find out more about this in the implementation section. An optimisation is in principle possible, and would be useful for all array types not just PDL ones. Similar optimizations for sparse and non-rectangular arrays would be desirable, this would be especially useful in a perl for numerics. =head2 Signatures: threading over elementary operations In a large number of practical situations encountered in scientific numerical programming fundamental (simple) operations are iterated over many (small) chunks of (multi-dimensional) data. A very simple examples is addition $c = $a + $b of two multi-dimensional arrays (PDL objects). The fundamental operation is the addition of two scalar elements. This fundamental operation is just iterated over all pairs of elements of C<$a> and C<$b> and results placed element by element into C<$c>. Based on this observation PDL introduced the notion of I<PDL threading> to implement such functionality in a flexible and efficient way: it provides tools (C<PDL::PP>) to define those optimized innermost loops (the fundamental operations) and establishes a framework to efficiently execute loops over these fundamental operations automatically as prescribed by the dimensionality of the input data. Most importantly, those loops over the innermost fundamental operations can generally be run in parallel (i.e. run in separate I<threads>). PDL currently has experimental support to exploit this observation. Two examples using the inner product as the fundamental operation may serve to illustrate the idea. To convert an RGB image into a greyscale image an inner product with a triple of weights with each RGB triple is performed: $grey = inner $rgb, pdl(0.301,0.586,0.113); The key is here that regardless of the number of trailing dimensions of the RGB object C<$rgb> (which has a shape [3,.....]) the fundamental operation (the inner product) is speedily iterated over all triples of the data. So C<$rgb> could be a single RGB pixel (shape [3]), a line of RGB data (shape [3,x]), an RGB image (shape [3,x,y]), a stack of RGB images ([3,x,y,z]) and so on. Each function that can be used in such a way in PDL has a I<signature> that symbolically states dimensions consumed and generated in a fundamental operation. Illustrating this by the two functions we have used above the signatures of addition and inner product are a(); b(); [o] c() # addition, we use ';'s as separators a(n); b(n); [o] c() # inner product In this symbolic notation C<[o]> flags arguments which are created (output arguments). The symbols in parentheses name any dimensions of the variables involved. Empty parentheses refer to scalars (0-dim). So addition takes two scalar input arguments and generates one output scalar. The inner product consumes, as expected, two one-dimensional input args (symbolized by C<a(n)> and C<b(n)>) to produce a scalar output (C<[o] c()>). To complete the picture PDL has rules how these fundamental operations consume chunks of data of the input args to produce a muti-dimensional result of chunks of output data (the possibly multi-dimensional result piddle). The user decides exactly how those operations are iterated over his data by manipulating the dimensions of the input data appropriately. That is illustrated in the promised second example. Those familar with linear algebra will know that a matrix product consists of the result of inner products between all pairs of row and column vectors of two input matrices. Not surprisingly we can therefore use the inner product to compute the matrix product of two PDL objects C<$a> and C<$b>: $c = inner $a->dummy(1), $b->xchg(0,1)->dummy(2); C<dummy> and C<xchg> are dimension manipulation functions (of the smart reference type) that are used to change the dimensionality of the input piddles so that the inner product is iterated over the desired pairs of sub-vectors and placed into the resulting piddle C<$c>. We do not expect the casual reader to follow the details of this example but to appreciate the underlying idea. The essence of I<PDL threading> is here that (1) it is a useful feature in I<many> practical situations (2) can be very efficiently implemented and (3) exposes parallelism in the underlying algorithm. One could go as far as saying that the ideas underlying I<threading> are PDL's main programming paradigm and the user is expected to rethink his approach accordingly for optimal performance. =head2 Defining new PDL functions -- Glue code generation One of the foremost design criteria was to implement a small set of core structures, routines and code generators that would depend heavily on each other. The rest of the package (implementing the real numerical functionality) should only depend on the external interface of these core functions. Encapsulation can be achieved by consciously looking for code that would be repeated in subroutines and either making a new subroutine or a code generator do the job for the programmer. This way, when the repeated code suddenly needs to be different, there is no painstaking need of changing all instances of it - simply changing the subroutine or code generator will do it. We emphasize this because this kind of thinking seems to be forgotten all too often. PDL introduced the code generator PDL::PP to create subroutines from concise descriptions to allow the programmer to focus on the algorithmic side of things. In many ways PDL::PP can be thought of as the XS of PDL: it is complex to learn, powerful, allows easy interfacing at the C-level and is, once mastered, straightforward to use. Beyond that using PDL::PP ensures that new routines follow uniform calling conventions and support PDL threading, slicing, etc. In the current implementation it suffers from similar problems as XS in perl5: the syntax is strange and hard to learn, it is also feature incomplete (i.e. certain things can not be done right now). However, it implements a number of ideas that we find are absolutely crucial to pertain and develop further in a numerically friendly perl: algorithmic abstraction, encapsulation and provision of a tool for easy interfacing to the hundreds of existing numerical, graphics and IO libraries that are to be tapped into. As an illustration for the abstraction that is achieved, here is the definition for an inner product function. This definition tells the code generator (C<PDL::PP>) to generate the appropriate XS code that will then be compiled as part of a new loadable PDL module: pp_def( 'inner', Pars => 'a(n); b(n); [o]c(); ', # the signature, see above Code => 'double tmp = 0; loop(n) %{ tmp += $a() * $b(); %} $c() = tmp;' ); The above code specifies firstly that the two first vectors must have the same first dimension and the third vector will not have that dimension. Inside the loop it is possible to use C<$a()> without specifying the index because we have specified that we are looping over the dimension called C<n>. Notice that we do not do any checking that C<$a> and C<$b> have the same sized first dimension - we just tell PDL::PP that we expect this to be so and it checks whether any data passed to the C<inner> routine obeys this assumption and C<croak>s if it does not. After installation the user can then use $c = inner $a,$b; to compute the inner product of two piddles. The code clearly illustrates the abstraction achieved, making no reference to an actual data type or how the vectors are stored. This kind of abstraction is similar to the ideas used in the C++ Standard Template Library. As a bonus, the use of PP means a lot of the tedious housekeeping of programming can be automated. Almost all index troubles vanish when writing and using these subroutines: all the necessary index checks are done automatically by the code generator. The code generator also ensures that calling the function with arguments that have more dimensions than the function expects works as well - the function is simply repeatedly called in a loop over all these dimensions (I<threaded> over the dimensions in PDL parlance), providing a fast way of performing many similar operations in a row without going back to the Perl level (with the latest versions of PDL, it is possible to use multiple OS-level threads for these loops, enabling an easy way of gaining from a multiprocessor computer). This I<threading> allows many algorithms to be expressed concisely and PDL scripts rarely have cause for loops over elements on the Perl level. In a future perl this functionality might be integrated with a perl (JIT?) compiler. Code could be specified in perl(-like?) syntax and compiled into an optimized C level transformation as required. If this sounds rather vague it is not by accident -- we still lack a clear idea how to optimally achieve this goal. A nice perlish syntax to replace the PP gibberish should certainly be the starting point. =head1 IMPLEMENTATION The PDL core is currently suffering from similar problems as perl 5: it works but is poorly documented and many hacks make it increasingly difficult to modify and maintain. But it works C<;)>. In any case, some of the perl6 guys have asked for a rundown of the principal guts of the implementation. Here it goes... =head2 Compact, typed multi-dimensional arrays On the Perl side, an opaque scalar reference turned out to be the most practical representation: Perl5 allows overloading of operators so C<$c=$a+$b> can mean "add the objects (here representing matrices) C<$a> and C<$b> and store the result in the scalar C<$c> with whatever type that the result has". As with any perl object, new objects can be easily derived from piddles for specialised purposes. In Perl6 it may be desirable to use C<@syntax> for all arrays - whether compact, implementing smart references, or not. The use of C<$a> for PDL arrays and C<@a> for normal perl arrays B<is> confusing, we would like to see some unification. Ideally perl arrays would support some, or all, of the PDL features whether they are multi-dimensional or not. =over 5 =item What is in a Piddle. The PDL structure is defined in C<pdl.h> struct pdl { unsigned long magicno; /* Always stores PDL_MAGICNO as a sanity check */ /* This is first so most pointer accesses to wrong type are caught */ int state; /* What's in this pdl */ pdl_trans *trans; /* Opaque pointer to internals of transformation from parent */ pdl_vaffine *vafftrans; void* sv; /* (optional) pointer back to original sv. ALWAYS check for non-null before use. We cannot inc refcnt on this one or we'd never get destroyed */ void *datasv; /* Pointer to SV containing data. Refcnt inced */ void *data; /* Null: no data alloced for this one */ int nvals; /* How many values allocated */ int datatype; PDL_Long *dims; /* Array of data dimensions */ PDL_Long *dimincs; /* Array of data default increments */ short ndims; /* Number of data dimensions */ unsigned char *threadids; /* Starting index of the thread index set n */ unsigned char nthreadids; pdl *progenitor; /* I'm in a mutated family. make_physical_now must copy me to the new generation. */ pdl *future_me; /* I'm the "then" pdl and this is my "now" (or more modern version, anyway */ pdl_children children; short living_for; /* perl side not referenced; delete me when */ PDL_Long def_dims[PDL_NDIMS]; /* Preallocated space for efficiency */ PDL_Long def_dimincs[PDL_NDIMS]; /* Preallocated space for efficiency */ unsigned char def_threadids[PDL_NTHREADIDS]; struct pdl_magic *magic; void *hdrsv; /* "header", settable from outside */ }; This is quite a structure for just storing some data in - what is going on? =item Data storage We are going to start with some of the simpler members: first of all, there is the member void *datasv; which is really a pointer to a Perl SV structure (C<SV *>). The SV is expected to be representing a string, in which the data of the piddle is stored in a tightly packed form. This pointer counts as a reference to the SV so the reference count has been incremented when the C<SV *> was placed here. This pointer is allowed to have the value C<NULL> which means that there is no Perl SV for this data - for instance, the data might be allocated by a C<mmap> operation. Note the use of an SV* was purely for convenience, it allows easy transformation of packed data from files into piddles. Other implementations are not excluded. The actual pointer to data is stored in the member void *data; which contains a pointer to a memory area with space for int nvals; data items of the data type of this piddle. The data type of the data is stored in the variable int datatype; the values for this member are given in the enum C<pdl_datatypes>, discussed above. =item Dimensions The number of dimensions in the piddle is given by the member int ndims; which shows how many entries there are in the arrays PDL_Long *dims; PDL_Long *dimincs; These arrays are intimately related: C<dims> gives the sizes of the dimensions and C<dimincs> is always calculated by the code int inc = 1; for(i=0; i<it->ndims; i++) { it->dimincs[i] = inc; inc *= it->dims[i]; } in the routine C<pdl_resize_defaultincs> in C<pdlapi.c>. What this means is that the dimincs can be used to calculate the offset by code like int offs = 0; for(i=0; i<it->ndims; i++) { offs += it->dimincs[i] * index[i]; } but this is not always the right thing to do, at least without checking for certain things first. =item Default storage Since the vast majority of piddles don't have more than 6 dimensions, it is more efficient to have default storage for the dimensions and dimincs inside the PDL struct. PDL_Long def_dims[PDL_NDIMS]; PDL_Long def_dimincs[PDL_NDIMS]; The C<dims> and C<dimincs> may be set to point to the beginning of these arrays if C<ndims> is smaller than or equal to the compile-time constant C<PDL_NDIMS>. This is important to note when freeing a piddle struct. The same applies for the threadids: unsigned char def_threadids[PDL_NTHREADIDS]; =item Magic It is possible to attach magic to piddles, much like Perl's own magic mechanism. If the member pointer struct pdl_magic *magic; is nonzero, the PDL has some magic attached to it. The implementation of magic is discussed below. =item State One of the first members of the structure is int state; The possible flags and their meanings are given in C<pdl.h>. =item Transformations and virtual affine transformations As you should already know, piddles often carry information about where they come from. For example, the code $b = $a->slice("2:5"); $b .= 1; will alter $a. This information is stored in the members pdl_trans *trans; pdl_vaffine *vafftrans; C<pdl_trans> and C<pdl_vaffine> are structures that we will look at in more detail below. =item The Perl SVs When piddles are referred to through Perl SVs, we store an additional reference to it in the member void* sv; in order to be able to return a reference to the user when he wants to inspect the transformation structure on the Perl side. Also, we store an opaque void *hdrsv; which is just for use by the user to hook up arbitrary data with this sv. =back =head2 Smart references and transformations: slicing and dicing Smart references and most other fundamental functions operating on piddles are implemented via I<transformations> which are represented by the type C<pdl_trans> in PDL. A transformation links input and output piddles and contains all the infrastructure that defines how =over 4 =item * output piddles are obtained from input piddles =item * changes in smartly linked output piddles (e.g. the I<child> of a sliced I<parent> piddle) are flown back to the input piddle in transformations where this is supported (the most often used example being C<slice> here). =item * datatype and size of output piddles that need to be created are obtained =back In general, executing a PDL function on a group of piddles results in creation of a transformation of the requested type that links all input and output arguments (at least those that are piddles). In PDL functions that support data flow between input and output args (e.g. C<slice>, C<index>) this transformation links I<parent> (input) and I<child> (output) piddles permanently until either the link is explicitly broken by user request (C<sever> at the perl level) or all parents and childen have been destroyed. In those cases the transformation is lazy-evaluated, e.g. only executed when piddle values are actually accessed. In I<non-flowing> functions, for example addition (C<+>) and inner products (C<inner>), the transformation is installed just as in flowing functions but then the transformation is immediately executed and destroyed (breaking the link between input and output args) before the function returns. It should be noted that the close link between input and output args of a flowing function requires that piddle objects that are linked in such a way be kept alive beyond the point where they have gone out of scope from the point of view of perl: $a = zeroes(20); $b = $a->slice('2:4'); undef $a; # last reference to $a is now destroyed Although $a should now be destroyed according to perl's rules the underlying C<*pdl> must actually only be freed when C<$b> also goes out of scope (since it still references internally some of C<$a>'s data). This example demonstrates that such a dataflow paradigm between PDL objects necessitates a special destruction algorithm that takes the links between piddles into account and couples the lifespan of those objects. The non-trivial algorithm is implemented in the function C<pdl_destroy> in F<pdlapi.c>. =head2 What's in a C<pdl_trans> All transformations are implemented as structures struct XXX_trans { int magicno; /* to detect memory overwrites */ short flags; /* state of the trans */ pdl_transvtable *vtable; /* the all important vtable */ void (*freeproc)(struct pdl_trans *); /* Call to free this trans (in case we had to malloc some stuff dor this trans) */ pdl *pdls[NP]; /* The pdls involved in the transformation */ int __datatype; /* the type of the transformation */ /* in general more members /* depending on the actual transformation (slice, add, etc) */ }; The transformation identifies all C<pdl>s involved in the trans pdl *pdls[NP]; with C<NP> depending on the number of piddle args of the particular trans. It records a state short flags; and the datatype int __datatype; of the trans (to which all piddles must be converted unless they are explicitly typed). Most important is the pointer to the vtable that contains the actual functionality pdl_transvtable *vtable; The vtable structure in turn looks something like (slightly simplified from F<pdl.h> for clarity) typedef struct pdl_transvtable { pdl_transtype transtype; int flags; int nparents; /* number of parent pdls (input) */ int npdls; /* number of child pdls (output) */ char *per_pdl_flags; /* optimization flags */ void (*redodims)(pdl_trans *tr); /* figure out dims of children */ void (*readdata)(pdl_trans *tr); /* flow parents to children */ void (*writebackdata)(pdl_trans *tr); /* flow backwards */ void (*freetrans)(pdl_trans *tr); /* Free both the contents and it of the trans member */ pdl_trans *(*copy)(pdl_trans *tr); /* Full copy */ int structsize; char *name; /* For debuggers, mostly */ } pdl_transvtable; We focus on the callback functions: void (*redodims)(pdl_trans *tr); C<redodims> will work out the dimensions of piddles that need to be created and is called from within the API function that should be called to ensure that the dimensions of a piddle are accessible (F<pdlapi.c>): void pdl_make_physdims(pdl *it) C<readdata> and C<writebackdata> are responsible for the actual computations of the child data from the parents or parent data from those of the children, respectively (the dataflow aspect). The PDL core makes sure that these are called as needed when piddle data is accessed (lazy-evaluation). The general API function to ensure that a piddle is up-to-date is void pdl_make_physvaffine(pdl *it) which should be called before accessing piddle data from XS/C (see F<Core.xs> for some examples). C<freetrans> frees dynamically allocated memory associated with the trans as needed and C<copy> can copy the transformation. The transformation and vtable code is hardly ever written by hand but rather generated by C<PDL::PP> from concise descriptions (see the example of the definition of C<inner> above). Certain types of transformations can be optimized very efficiently obviating the need for explicit C<readdata> and C<writebackdata> methods. Those transformations are called I<pdl_vaffine>. Most dimension manipulating functions (e.g., C<slice>, C<xchg>) belong to this class. The basic trick is that parent and child of such a transformation work on the same (shared) block of data which they just choose to interpret differently (by dusing different C<dims>, C<dimincs> and C<offs> on the same data, compare the C<pdl> structure above). Each operation on a piddle sharing data with another one in this way is therefore automatically flown from child to parent and back -- they are reading and writing the same block of memory. This is currently not perl thread safe -- no big loss since the whole PDL core is not reentrant (perl threading C<!=> PDL threading). A more detailed description of I<vaffines> will be added in a later version of this document. =head2 Signatures: threading over elementary operations Most of that functionality is implemented in the file F<pdlthread.c>. The C<PDL::PP> generated functions (in particular the C<readdata> and C<writebackdata> callbacks) use this infrastructure to make sure that the fundamental operation implemented by the trans is performed in agreement with PDL's threading semantics. =head2 Defining new PDL functions -- Glue code generation Please, see L<PDL::PP> and examples in the PDL distribution. Implementation and syntax are currently far from perfect. As the above example on the definition of C<inner> suggests, the principal functionality is implemented through the function C<pp_def> which takes a hash of directives as a concise specification of the desired functionality. The C<Code> and C<RedoDimsCode> fields contain the algorithmic specification in a home grown pseudo code syntax that has some faint similarity to a mixture of perl, XS and C. Here we could do with some good ideas for improvement. From this description C<PDL::PP> produces C and XS code (which will go away in a future perl?) by processing the hash through a translation table. The ability to write the fundamental algorithm in a perlish way (or even in perl) that is optimized by (ideally JIT) compilation is what we are really after. =head1 SUMMARY We have explained the key ideas which make PDL work, and work well. It remains to be determined which of these might belong in the core of perl6, in order to make PDL work more seamlessly, and which belong in the module. Our take on this: perl6 should support a compact, numerical array type which is overloadable and can be manipulated with nice syntax. A key point is to abolish the visual distinction between perl arrays (C<@x>) and PDL arrays (C<$x>) which is confusing but forced upon us by the current object paradigm. There is also considerable scope for improving syntax to make numerical Perl more user-friendly - this is explained in more detail by the RFCs on Ranges and Default Methods. Compact arrays should support virtual slices when being manipulated. It seems to us that the more detailed/generalised PDL threading paradigm is beyond the core and belongs in a module. Some syntactical hooks would be appreciated though so one could at least say code like this at the Perl level: sub mysub : PDL { my($a, $b) = @_; signature('a(n), b(n)'); loop(n) { # n a bareword, loop has prototype # Threaded code } } =head1 REFERENCES L<PDL> (http://pdl.sourceforge.net/PDLdocs) http://pdl.perl.org http://pdl.sourceforge.net/FAQ RFC 117: Perl syntax support for ranges Matlab IDL (the one from rsinc) Yorick Mathematica NumPython http://starship.python.net/~da/numtut/ Hords of other array-oriented languages, try http://SAL.KachinaTech.COM/A/2/index.shtml =head1 NOTES [1] LW: Let me just add that I don't mind the brainstorming at all. To be a good language designer, you have to stuff your brain with what you *could* do before you can reasonably choose what you *will* do. At the moment, I'm not only trying to follow along here; I'm also reading all the books on computer languaes I can get my hands on--not just to look for ideas to steal, but also to remind myself of the mindset Perl was designed to escape. In fact, I'd go as far as to say that it's imperative that you overstuff your brain to the point where you no longer feel tempted to overstuff the language. Hypotheticality is your friend.