[sage-devel] Re: [Jmol-developers] [sage-devel] Fwd: [sage-devel] Re: Jmol andMathematics Visualization

Fernando Perez Sun, 30 Dec 2007 00:22:18 -0800

Howdy,

On Dec 29, 2007 10:59 PM, Robert Bradshaw <[EMAIL PROTECTED]> wrote:
>
> On Dec 29, 2007, at 9:15 PM, Robert Hanson wrote:
>
> > I'm a bit lost on this thread, but I wanted to respond to the
> > binary/multiple file issue.
> >
> > First, it's a fine idea to create a binary Pmesh file format. If we do
> > that, though, let's not rush into it and just "create a binary
> > equivalent
> > of a Pmesh file." If this is really useful, then let's create a
> > format that
> >
> >
> > 1) allows for multiple pmesh objects
>
> Sure, though if the zip file thing is working I think it is fine to
> have one object per file too (as we also want to specify color, etc.
> and will probably have spheres, labels, etc. too, so we'll be dealing
> with multiple files anyway and it probably isn't worth trying to
> figure out a way to encode this as scripts work so nice).
>
> > 2) includes a header that clearly distiguishes the file format
> > within the
> > first 4 bytes -- the "magic number" idea.


[...]

Just as an FYI: as of the last few days, numpy has developed a binary
format for arbitrary arrays.  The current plan is to have the base
file format (default extension .npy, but there's a magic string header
for extension-less identification) contain single arrays, and to use
zip files for multi-array files with a dict-like interface.

It's in a branch right now, here's the format spec:

http://projects.scipy.org/scipy/numpy/browser/branches/lib_for_io/format.py

That branch has the rest of the i/o utilities.

I have no idea if this format might suit your needs, but if it does,
it might be a useful way to share data with the rest of the python
world (for example, arrays in this format can be automatically used in
VTK with the Enthought TVTK library).

Below I'll paste the full PEP that Robert Kern wrote for the file
format when the discussion was taking place.

Sorry for the noise if this proves to be non-useful.

f

############ PEP-style document for the array format.

Title: A Simple File Format for NumPy Arrays
Discussions-To: [EMAIL PROTECTED]
Version: $Revision$
Last-Modified: $Date$
Author: Robert Kern <[EMAIL PROTECTED]>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 20-Dec-2007


Abstract

   We propose a standard binary file format (NPY) for persisting
   a single arbitrary NumPy array on disk.  The format stores all of
   the shape and dtype information necessary to reconstruct the array
   correctly even on another machine with a different architecture.
   The format is designed to be as simple as possible while achieving
   its limited goals.  The implementation is intended to be pure
   Python and distributed as part of the main numpy package.


Rationale

   A lightweight, omnipresent system for saving NumPy arrays to disk
   is a frequent need.  Python in general has pickle [1] for saving
   most Python objects to disk.  This often works well enough with
   NumPy arrays for many purposes, but it has a few drawbacks:

   - Dumping or loading a pickle file require the duplication of the
     data in memory.  For large arrays, this can be a showstopper.

   - The array data is not directly accessible through
     memory-mapping.  Now that numpy has that capability, it has
     proved very useful for loading large amounts of data (or more to
     the point: avoiding loading large amounts of data when you only
     need a small part).

   Both of these problems can be addressed by dumping the raw bytes
   to disk using ndarray.tofile() and numpy.fromfile().  However,
   these have their own problems:

   - The data which is written has no information about the shape or
     dtype of the array.

   - It is incapable of handling object arrays.

   The NPY file format is an evolutionary advance over these two
   approaches.  Its design is mostly limited to solving the problems
   with pickles and tofile()/fromfile().  It does not intend to solve
   more complicated problems for which more complicated formats like
   HDF5 [2] are a better solution.


Use Cases

   - Neville Newbie has just started to pick up Python and NumPy.  He
     has not installed many packages, yet, nor learned the standard
     library, but he has been playing with NumPy at the interactive
     prompt to do small tasks.  He gets a result that he wants to
     save.

   - Annie Analyst has been using large nested record arrays to
     represent her statistical data.  She wants to convince her
     R-using colleague, David Doubter, that Python and NumPy are
     awesome by sending him her analysis code and data.  She needs
     the data to load at interactive speeds.  Since David does not
     use Python usually, needing to install large packages would turn
     him off.

   - Simon Seismologist is developing new seismic processing tools.
     One of his algorithms requires large amounts of intermediate
     data to be written to disk.  The data does not really fit into
     the industry-standard SEG-Y schema, but he already has a nice
     record-array dtype for using it internally.

   - Polly Parallel wants to split up a computation on her multicore
     machine as simply as possible.  Parts of the computation can be
     split up among different processes without any communication
     between processes; they just need to fill in the appropriate
     portion of a large array with their results.  Having several
     child processes memory-mapping a common array is a good way to
     achieve this.


Requirements

   The format MUST be able to:

   - Represent all NumPy arrays including nested record
     arrays and object arrays.

   - Represent the data in its native binary form.

   - Be contained in a single file.

   - Support Fortran-contiguous arrays directly.

   - Store all of the necessary information to reconstruct the array
     including shape and dtype on a machine of a different
     architecture.  Both little-endian and big-endian arrays must be
     supported and a file with little-endian numbers will yield
     a little-endian array on any machine reading the file.  The
     types must be described in terms of their actual sizes.  For
     example, if a machine with a 64-bit C "long int" writes out an
     array with "long ints", a reading machine with 32-bit C "long
     ints" will yield an array with 64-bit integers.

   - Be reverse engineered.  Datasets often live longer than the
     programs that created them.  A competent developer should be
     able create a solution in his preferred programming language to
     read most NPY files that he has been given without much
     documentation.

   - Allow memory-mapping of the data.

   - Be read from a filelike stream object instead of an actual file.
     This allows the implementation to be tested easily and makes the
     system more flexible.  NPY files can be stored in ZIP files and
     easily read from a ZipFile object.

   - Store object arrays.  Since general Python objects are
     complicated and can only be reliably serialized by pickle (if at
     all), many of the other requirements are waived for files
     containing object arrays.  Files with object arrays do not have
     to be mmapable since that would be technically impossible.  We
     cannot expect the pickle format to be reverse engineered without
     knowledge of pickle.  However, one should at least be able to
     read and write object arrays with the same generic interface as
     other arrays.

   - Be read and written using APIs provided in the numpy package
     itself without any other libraries.  The implementation inside
     numpy may be in C if necessary.

   The format explicitly *does not* need to:

   - Support multiple arrays in a file.  Since we require filelike
     objects to be supported, one could use the API to build an ad
     hoc format that supported multiple arrays.  However, solving the
     general problem and use cases is beyond the scope of the format
     and the API for numpy.

   - Fully handle arbitrary subclasses of numpy.ndarray.  Subclasses
     will be accepted for writing, but only the array data will be
     written out.  A regular numpy.ndarray object will be created
     upon reading the file.  The API can be used to build a format
     for a particular subclass, but that is out of scope for the
     general NPY format.


Format Specification: Version 1.0

   The first 6 bytes are a magic string: exactly "\x93NUMPY".

   The next 1 byte is an unsigned byte: the major version number of
   the file format, e.g. \x01.

   The next 1 byte is an unsigned byte: the minor version number of
   the file format, e.g. \x00.  Note: the version of the file format
   is not tied to the version of the numpy package.

   The next 2 bytes form a little-endian unsigned short int: the
   length of the header data HEADER_LEN.

   The next HEADER_LEN bytes form the header data describing the
   array's format.  It is an ASCII string which contains a Python
   literal expression of a dictionary.  It is terminated by a newline
   ('\n') and padded with spaces ('\x20') to make the total length of
   the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
   alignment purposes.

   The dictionary contains three keys:

       "descr" : dtype.descr
           An object that can be passed as an argument to the
           numpy.dtype() constructor to create the array's dtype.

       "fortran_order" : bool
           Whether the array data is Fortran-contiguous or not.
           Since Fortran-contiguous arrays are a common form of
           non-C-contiguity, we allow them to be written directly to
           disk for efficiency.

       "shape" : tuple of int
           The shape of the array.

   For repeatability and readability, this dictionary is formatted
   using pprint.pformat() so the keys are in alphabetic order.

   Following the header comes the array data.  If the dtype contains
   Python objects (i.e. dtype.hasobject is True), then the data is
   a Python pickle of the array.  Otherwise the data is the
   contiguous (either C- or Fortran-, depending on fortran_order)
   bytes of the array.  Consumers can figure out the number of bytes
   by multiplying the number of elements given by the shape (noting
   that shape=() means there is 1 element) by dtype.itemsize.


Alternatives

   The author believes that this system (or one along these lines) is
   about the simplest system that satisfies all of the requirements.
   However, one must always be wary of introducing a new binary
   format to the world.

   HDF5 [2] is a very flexible format that should be able to
   represent all of NumPy's arrays in some fashion.  It is probably
   the only widely-used format that can faithfully represent all of
   NumPy's array features.  It has seen substantial adoption by the
   scientific community in general and the NumPy community in
   particular.  It is an excellent solution for a wide variety of
   array storage problems with or without NumPy.

   HDF5 is a complicated format that more or less implements
   a hierarchical filesystem-in-a-file.  This fact makes satisfying
   some of the Requirements difficult.  To the author's knowledge, as
   of this writing, there is no application or library that reads or
   writes even a subset of HDF5 files that does not use the canonical
   libhdf5 implementation.  This implementation is a large library
   that is not always easy to build.  It would be infeasible to
   include it in numpy.

   It might be feasible to target an extremely limited subset of
   HDF5.  Namely, there would be only one object in it: the array.
   Using contiguous storage for the data, one should be able to
   implement just enough of the format to provide the same metadata
   that the proposed format does.  One could still meet all of the
   technical requirements like mmapability.

   We would accrue a substantial benefit by being able to generate
   files that could be read by other HDF5 software.  Furthermore, by
   providing the first non-libhdf5 implementation of HDF5, we would
   be able to encourage more adoption of simple HDF5 in applications
   where it was previously infeasible because of the size of the
   library.  The basic work may encourage similar dead-simple
   implementations in other languages and further expand the
   community.

   The remaining concern is about reverse engineerability of the
   format.  Even the simple subset of HDF5 would be very difficult to
   reverse engineer given just a file by itself.  However, given the
   prominence of HDF5, this might not be a substantial concern.


Implementation

   The current implementation is in a branch of the numpy SVN
   repository.

       http://svn.scipy.org/svn/numpy/branches/lib_for_io

   This is just a branch of the numpy/lib/ directory, so one can
   graft it onto a trunk checkout like so::

       $ pwd
       /Users/rkern/svn/numpy
       $ cd numpy/lib
       $ svn switch http://svn.scipy.org/svn/numpy/branches/lib_for_io

   Specifically, the file format.py in this directory implements the
   format as described here.


References

   [1] http://docs.python.org/lib/module-pickle.html

   [2] http://hdf.ncsa.uiuc.edu/products/hdf5/index.html


Copyright

   This document has been placed in the public domain.



Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:

--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://sage.scipy.org/sage/ and http://modular.math.washington.edu/sage/
-~----------~----~----~----~------~----~------~--~---

[sage-devel] Re: [Jmol-developers] [sage-devel] Fwd: [sage-devel] Re: Jmol andMathematics Visualization

Reply via email to