date:20170427

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted

2017-04-27 3:34 GMT+02:00 Stephan Hoyer :

> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>
>> It's worthwhile enough that both major HDF5 bindings don't support
>> Unicode arrays, despite user requests for years. The sticking point seems
>> to be the difference between HDF5's view of a Unicode string array (defined
>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>> string array (because of UCS-4, defined by the number of
>> characters/codepoints/whatever). So there are HDF5 files out there that
>> none of our HDF5 bindings can read, and it is impossible to write certain
>> data efficiently.
>>
>>
>> I would really like to hear more from the authors of these libraries
>> about what exactly it is they feel they're missing. Is it that they want
>> numpy to enforce the length limit early, to catch errors when the array is
>> modified instead of when they go to write it to the file? Is it that they
>> really want an O(1) way to look at a array and know the maximum number of
>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>> is really annoying and files that need it are rare so they haven't had the
>> motivation to implement it? My impression is similar to Julian's: you
>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>> dozen lines of code, which is nothing compared to all the other hoops these
>> libraries are already jumping through, so if this is really the roadblock
>> then I must be missing something.
>>
>
> I actually agree with you. I think it's mostly a matter of convenience
> that h5py matched up HDF5 dtypes with numpy dtypes:
> fixed width ASCII -> np.string_/bytes
> variable length ASCII -> object arrays of np.string_/bytes
> variable length UTF-8 -> object arrays of unicode
>
> This was tenable in a Python 2 world, but on Python 3 it's broken and
> there's not an easy fix.
>
> We absolutely could fix h5py by mapping everything to object arrays of
> Python unicode strings, as has been discussed (
> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
> be a fine but non-ideal solution, since there is currently no fixed width
> UTF-8 support.
>
> For fixed width ASCII arrays, this would mean increased convenience for
> Python 3 users, at the price of decreased convenience for Python 2 users
> (arrays now contain boxed Python objects), unless we made the h5py behavior
> dependent on the version of Python. Hence, we're back here, waiting for
> better dtypes for encoded strings.
>
> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
> handling ASCII arrays as strings) and UTF-8 with length equal to the number
> of bytes.
>

Well, I'll say upfront that I have not read this discussion in the fully,
but apparently some opinions from developers of HDF5 Python packages would
be welcome here, so here I go :) 

As a long-time developer of one of the Python HDF5 packages (PyTables), I
have always been of the opinion that plain ASCII (for byte strings) and
UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
large amounts of data, most specially for disk storage (but also using
compressed in-memory containers).  My rational is that, although UCS-4 may
require way too much space, compression would reduce that to basically the
space that is required by compressed UTF-8 (I won't go into detail, but
basically this is possible by using the shuffle filter).

I remember advocating for UCS-4 adoption in the HDF5 library many years ago
(2007?), but I had no success and UTF-8 was decided to be the best
candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
don't think there is a go back (not even adding UCS-4 support on it,
although I continue to think it would be a good idea).  So, I suppose that
if HDF5 is found to be an important format for NumPy users (and I think
this is the case), a solution for representing Unicode characters by using
UTF-8 in NumPy would be desirable (at the risk of making the implementation
more complex).

Francesc


>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Neal Becker

So while compression+ucs-4 might be OK for out-of-core representation, what
about in-core?  blosc+ucs-4?  I don't think that works for mmap, does it?

On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted  wrote:

> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer :
>
>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>>
>>> It's worthwhile enough that both major HDF5 bindings don't support
>>> Unicode arrays, despite user requests for years. The sticking point seems
>>> to be the difference between HDF5's view of a Unicode string array (defined
>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>> string array (because of UCS-4, defined by the number of
>>> characters/codepoints/whatever). So there are HDF5 files out there that
>>> none of our HDF5 bindings can read, and it is impossible to write certain
>>> data efficiently.
>>>
>>>
>>> I would really like to hear more from the authors of these libraries
>>> about what exactly it is they feel they're missing. Is it that they want
>>> numpy to enforce the length limit early, to catch errors when the array is
>>> modified instead of when they go to write it to the file? Is it that they
>>> really want an O(1) way to look at a array and know the maximum number of
>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>> is really annoying and files that need it are rare so they haven't had the
>>> motivation to implement it? My impression is similar to Julian's: you
>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>> dozen lines of code, which is nothing compared to all the other hoops these
>>> libraries are already jumping through, so if this is really the roadblock
>>> then I must be missing something.
>>>
>>
>> I actually agree with you. I think it's mostly a matter of convenience
>> that h5py matched up HDF5 dtypes with numpy dtypes:
>> fixed width ASCII -> np.string_/bytes
>> variable length ASCII -> object arrays of np.string_/bytes
>> variable length UTF-8 -> object arrays of unicode
>>
>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>> there's not an easy fix.
>>
>> We absolutely could fix h5py by mapping everything to object arrays of
>> Python unicode strings, as has been discussed (
>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>> would be a fine but non-ideal solution, since there is currently no fixed
>> width UTF-8 support.
>>
>> For fixed width ASCII arrays, this would mean increased convenience for
>> Python 3 users, at the price of decreased convenience for Python 2 users
>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>> dependent on the version of Python. Hence, we're back here, waiting for
>> better dtypes for encoded strings.
>>
>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>> of bytes.
>>
>
> Well, I'll say upfront that I have not read this discussion in the fully,
> but apparently some opinions from developers of HDF5 Python packages would
> be welcome here, so here I go :) 
>
> As a long-time developer of one of the Python HDF5 packages (PyTables), I
> have always been of the opinion that plain ASCII (for byte strings) and
> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
> large amounts of data, most specially for disk storage (but also using
> compressed in-memory containers).  My rational is that, although UCS-4 may
> require way too much space, compression would reduce that to basically the
> space that is required by compressed UTF-8 (I won't go into detail, but
> basically this is possible by using the shuffle filter).
>
> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back (not even adding UCS-4 support on it,
> although I continue to think it would be a good idea).  So, I suppose that
> if HDF5 is found to be an important format for NumPy users (and I think
> this is the case), a solution for representing Unicode characters by using
> UTF-8 in NumPy would be desirable (at the risk of making the implementation
> more complex).
>
> Francesc
> 
>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> Francesc Alted
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted

2017-04-27 13:27 GMT+02:00 Neal Becker :

> So while compression+ucs-4 might be OK for out-of-core representation,
> what about in-core?  blosc+ucs-4?  I don't think that works for mmap, does
> it?
>

Correct, the real problem is mmap for an out-of-core, HDF5 representation,
I presume.

For in-memory, there are several compressed data containers, like:

https://github.com/alimanfoo/zarr (meant mainly for multidimensional data
containers)
https://github.com/Blosc/bcolz (meant mainly for tabular data containers)

(there might be others).



>
> On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted  wrote:
>
>> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer :
>>
>>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>>>
 It's worthwhile enough that both major HDF5 bindings don't support
 Unicode arrays, despite user requests for years. The sticking point seems
 to be the difference between HDF5's view of a Unicode string array (defined
 in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
 string array (because of UCS-4, defined by the number of
 characters/codepoints/whatever). So there are HDF5 files out there
 that none of our HDF5 bindings can read, and it is impossible to write
 certain data efficiently.


 I would really like to hear more from the authors of these libraries
 about what exactly it is they feel they're missing. Is it that they want
 numpy to enforce the length limit early, to catch errors when the array is
 modified instead of when they go to write it to the file? Is it that they
 really want an O(1) way to look at a array and know the maximum number of
 bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
 is really annoying and files that need it are rare so they haven't had the
 motivation to implement it? My impression is similar to Julian's: you
 *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
 dozen lines of code, which is nothing compared to all the other hoops these
 libraries are already jumping through, so if this is really the roadblock
 then I must be missing something.

>>>
>>> I actually agree with you. I think it's mostly a matter of convenience
>>> that h5py matched up HDF5 dtypes with numpy dtypes:
>>> fixed width ASCII -> np.string_/bytes
>>> variable length ASCII -> object arrays of np.string_/bytes
>>> variable length UTF-8 -> object arrays of unicode
>>>
>>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>>> there's not an easy fix.
>>>
>>> We absolutely could fix h5py by mapping everything to object arrays of
>>> Python unicode strings, as has been discussed (
>>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>>> would be a fine but non-ideal solution, since there is currently no fixed
>>> width UTF-8 support.
>>>
>>> For fixed width ASCII arrays, this would mean increased convenience for
>>> Python 3 users, at the price of decreased convenience for Python 2 users
>>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>>> dependent on the version of Python. Hence, we're back here, waiting for
>>> better dtypes for encoded strings.
>>>
>>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>>> of bytes.
>>>
>>
>> Well, I'll say upfront that I have not read this discussion in the fully,
>> but apparently some opinions from developers of HDF5 Python packages would
>> be welcome here, so here I go :) 
>>
>> As a long-time developer of one of the Python HDF5 packages (PyTables), I
>> have always been of the opinion that plain ASCII (for byte strings) and
>> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
>> large amounts of data, most specially for disk storage (but also using
>> compressed in-memory containers).  My rational is that, although UCS-4 may
>> require way too much space, compression would reduce that to basically the
>> space that is required by compressed UTF-8 (I won't go into detail, but
>> basically this is possible by using the shuffle filter).
>>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back (not even adding UCS-4 support on it,
>> although I continue to think it would be a good idea).  So, I suppose that
>> if HDF5 is found to be an important format for NumPy users (and I think
>> this is the case), a solution for representing Unicode characters by using
>> UTF-8 in NumPy would be desirable (at the risk of making the implementation
>> more complex).
>>
>> Francesc
>> 
>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>>

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Chris Barker

On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted  wrote:

> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>

This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and 

So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?

(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.

The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted

2017-04-27 18:18 GMT+02:00 Chris Barker :

> On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted  wrote:
>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back
>>
>
> This is the key point -- we can argue all we want about the best encoding
> for fixed-length unicode-supporting strings (I think numpy and HDF have
> very similar requirements), but that is not our decision to make -- many
> other systems have chosen utf-8, so it's a really good idea for numpy to be
> able to deal with that cleanly and easily and consistently.
>

Agreed.  But it would also be a good idea to spread the word that simple
UCS4 encoding in combination with compression can be a perfectly good
system for storing large amounts of unicode data too.


>
> I have made many anti utf-8 points in this thread because while we need to
> deal with utf-8 for interplay with other systems, I am very sure that it is
> not the best format for a default, naive-user-of-numpy unicode-supporting
> dtype. Nor is it the best encoding for a mostly-ascii compact in memory
> format.
>

I resonate a lot with this feeling too :)


>
> So I think numpy needs to support at least:
>
> utf-8
> latin-1
> UCS-4
>
> And it maybe should support one-byte encoding suitable for non-european
> languages, and maybe utf-16 for Java and Windows compatibility, and 
>
> So that seems to point to "support as many encodings as possible" And
> python has the machinery to do so -- so why not?
>
> (I'm taking Julian's word for it that having a parameterized dtype would
> not have a major impact on current code)
>
> If we go with a parameterized by encoding string dtype, then we can pick
> sensible defaults, and let users use what they know best fits their
> use-cases.
>
> As for python2 -- it is on the way out, I think we should keep the 'U' and
> 'S' dtypes as they are for backward compatibility and move forward with the
> new one(s) in a way that is optimized for py3. And it would map to a py2
> Unicode type.
>
> The only catch I see in that is what to do with bytes -- we should have a
> numpy dtype that matches the bytes model -- fixed length bytes that map to
> python bytes objects. (this is almost what teh void type is yes?) but then
> under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
> bytes objects??
>
> @Francesc: -- one more question for you:
>
> How important is it for pytables to match the numpy storage to the hdf
> storage byte for byte? i.e. would it be a killer if encoding / decoding
> happened every time at the boundary? I'm guessing yes, as this would have
> been solved long ago if not.
>

The PyTables team decided some time ago that it was a waste of time and
resources to maintain the internal HDF5 interface, and that it would be
better to switch to h5py for the low I/O communication with HDF5 (btw, we
just received a small NumFOCUS grant for continue the ongoing work on
this; thanks guys!).  This means that PyTables will be basically agnostic
about this sort of encoding issues, and that the important package to have
in account for interfacing NumPy and HDF5 is just h5py.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

5 matches

Site Navigation

Mail list logo

Footer information