Re: [Numpy-discussion] proposal: smaller representation of string arrays
2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > >> It's worthwhile enough that both major HDF5 bindings don't support >> Unicode arrays, despite user requests for years. The sticking point seems >> to be the difference between HDF5's view of a Unicode string array (defined >> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >> string array (because of UCS-4, defined by the number of >> characters/codepoints/whatever). So there are HDF5 files out there that >> none of our HDF5 bindings can read, and it is impossible to write certain >> data efficiently. >> >> >> I would really like to hear more from the authors of these libraries >> about what exactly it is they feel they're missing. Is it that they want >> numpy to enforce the length limit early, to catch errors when the array is >> modified instead of when they go to write it to the file? Is it that they >> really want an O(1) way to look at a array and know the maximum number of >> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >> is really annoying and files that need it are rare so they haven't had the >> motivation to implement it? My impression is similar to Julian's: you >> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >> dozen lines of code, which is nothing compared to all the other hoops these >> libraries are already jumping through, so if this is really the roadblock >> then I must be missing something. >> > > I actually agree with you. I think it's mostly a matter of convenience > that h5py matched up HDF5 dtypes with numpy dtypes: > fixed width ASCII -> np.string_/bytes > variable length ASCII -> object arrays of np.string_/bytes > variable length UTF-8 -> object arrays of unicode > > This was tenable in a Python 2 world, but on Python 3 it's broken and > there's not an easy fix. > > We absolutely could fix h5py by mapping everything to object arrays of > Python unicode strings, as has been discussed ( > https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would > be a fine but non-ideal solution, since there is currently no fixed width > UTF-8 support. > > For fixed width ASCII arrays, this would mean increased convenience for > Python 3 users, at the price of decreased convenience for Python 2 users > (arrays now contain boxed Python objects), unless we made the h5py behavior > dependent on the version of Python. Hence, we're back here, waiting for > better dtypes for encoded strings. > > So for HDF5, I see good use cases for ASCII-with-surrogateescape (for > handling ASCII arrays as strings) and UTF-8 with length equal to the number > of bytes. > Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :) As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter). I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex). Francesc > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it? On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > >> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: >> >>> It's worthwhile enough that both major HDF5 bindings don't support >>> Unicode arrays, despite user requests for years. The sticking point seems >>> to be the difference between HDF5's view of a Unicode string array (defined >>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >>> string array (because of UCS-4, defined by the number of >>> characters/codepoints/whatever). So there are HDF5 files out there that >>> none of our HDF5 bindings can read, and it is impossible to write certain >>> data efficiently. >>> >>> >>> I would really like to hear more from the authors of these libraries >>> about what exactly it is they feel they're missing. Is it that they want >>> numpy to enforce the length limit early, to catch errors when the array is >>> modified instead of when they go to write it to the file? Is it that they >>> really want an O(1) way to look at a array and know the maximum number of >>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >>> is really annoying and files that need it are rare so they haven't had the >>> motivation to implement it? My impression is similar to Julian's: you >>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >>> dozen lines of code, which is nothing compared to all the other hoops these >>> libraries are already jumping through, so if this is really the roadblock >>> then I must be missing something. >>> >> >> I actually agree with you. I think it's mostly a matter of convenience >> that h5py matched up HDF5 dtypes with numpy dtypes: >> fixed width ASCII -> np.string_/bytes >> variable length ASCII -> object arrays of np.string_/bytes >> variable length UTF-8 -> object arrays of unicode >> >> This was tenable in a Python 2 world, but on Python 3 it's broken and >> there's not an easy fix. >> >> We absolutely could fix h5py by mapping everything to object arrays of >> Python unicode strings, as has been discussed ( >> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >> would be a fine but non-ideal solution, since there is currently no fixed >> width UTF-8 support. >> >> For fixed width ASCII arrays, this would mean increased convenience for >> Python 3 users, at the price of decreased convenience for Python 2 users >> (arrays now contain boxed Python objects), unless we made the h5py behavior >> dependent on the version of Python. Hence, we're back here, waiting for >> better dtypes for encoded strings. >> >> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >> handling ASCII arrays as strings) and UTF-8 with length equal to the number >> of bytes. >> > > Well, I'll say upfront that I have not read this discussion in the fully, > but apparently some opinions from developers of HDF5 Python packages would > be welcome here, so here I go :) > > As a long-time developer of one of the Python HDF5 packages (PyTables), I > have always been of the opinion that plain ASCII (for byte strings) and > UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing > large amounts of data, most specially for disk storage (but also using > compressed in-memory containers). My rational is that, although UCS-4 may > require way too much space, compression would reduce that to basically the > space that is required by compressed UTF-8 (I won't go into detail, but > basically this is possible by using the shuffle filter). > > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think there is a go back (not even adding UCS-4 support on it, > although I continue to think it would be a good idea). So, I suppose that > if HDF5 is found to be an important format for NumPy users (and I think > this is the case), a solution for representing Unicode characters by using > UTF-8 in NumPy would be desirable (at the risk of making the implementation > more complex). > > Francesc > > >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > > -- > Francesc Alted > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
2017-04-27 13:27 GMT+02:00 Neal Becker : > So while compression+ucs-4 might be OK for out-of-core representation, > what about in-core? blosc+ucs-4? I don't think that works for mmap, does > it? > Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-memory, there are several compressed data containers, like: https://github.com/alimanfoo/zarr (meant mainly for multidimensional data containers) https://github.com/Blosc/bcolz (meant mainly for tabular data containers) (there might be others). > > On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > >> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : >> >>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: >>> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something. >>> >>> I actually agree with you. I think it's mostly a matter of convenience >>> that h5py matched up HDF5 dtypes with numpy dtypes: >>> fixed width ASCII -> np.string_/bytes >>> variable length ASCII -> object arrays of np.string_/bytes >>> variable length UTF-8 -> object arrays of unicode >>> >>> This was tenable in a Python 2 world, but on Python 3 it's broken and >>> there's not an easy fix. >>> >>> We absolutely could fix h5py by mapping everything to object arrays of >>> Python unicode strings, as has been discussed ( >>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >>> would be a fine but non-ideal solution, since there is currently no fixed >>> width UTF-8 support. >>> >>> For fixed width ASCII arrays, this would mean increased convenience for >>> Python 3 users, at the price of decreased convenience for Python 2 users >>> (arrays now contain boxed Python objects), unless we made the h5py behavior >>> dependent on the version of Python. Hence, we're back here, waiting for >>> better dtypes for encoded strings. >>> >>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >>> handling ASCII arrays as strings) and UTF-8 with length equal to the number >>> of bytes. >>> >> >> Well, I'll say upfront that I have not read this discussion in the fully, >> but apparently some opinions from developers of HDF5 Python packages would >> be welcome here, so here I go :) >> >> As a long-time developer of one of the Python HDF5 packages (PyTables), I >> have always been of the opinion that plain ASCII (for byte strings) and >> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing >> large amounts of data, most specially for disk storage (but also using >> compressed in-memory containers). My rational is that, although UCS-4 may >> require way too much space, compression would reduce that to basically the >> space that is required by compressed UTF-8 (I won't go into detail, but >> basically this is possible by using the shuffle filter). >> >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I >> don't think there is a go back (not even adding UCS-4 support on it, >> although I continue to think it would be a good idea). So, I suppose that >> if HDF5 is found to be an important format for NumPy users (and I think >> this is the case), a solution for representing Unicode characters by using >> UTF-8 in NumPy would be desirable (at the risk of making the implementation >> more complex). >> >> Francesc >> >> >>> >>> ___ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>>
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think there is a go back > This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently. I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format. So I think numpy needs to support at least: utf-8 latin-1 UCS-4 And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not? (I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code) If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases. As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type. The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects?? @Francesc: -- one more question for you: How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
2017-04-27 18:18 GMT+02:00 Chris Barker : > On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I >> don't think there is a go back >> > > This is the key point -- we can argue all we want about the best encoding > for fixed-length unicode-supporting strings (I think numpy and HDF have > very similar requirements), but that is not our decision to make -- many > other systems have chosen utf-8, so it's a really good idea for numpy to be > able to deal with that cleanly and easily and consistently. > Agreed. But it would also be a good idea to spread the word that simple UCS4 encoding in combination with compression can be a perfectly good system for storing large amounts of unicode data too. > > I have made many anti utf-8 points in this thread because while we need to > deal with utf-8 for interplay with other systems, I am very sure that it is > not the best format for a default, naive-user-of-numpy unicode-supporting > dtype. Nor is it the best encoding for a mostly-ascii compact in memory > format. > I resonate a lot with this feeling too :) > > So I think numpy needs to support at least: > > utf-8 > latin-1 > UCS-4 > > And it maybe should support one-byte encoding suitable for non-european > languages, and maybe utf-16 for Java and Windows compatibility, and > > So that seems to point to "support as many encodings as possible" And > python has the machinery to do so -- so why not? > > (I'm taking Julian's word for it that having a parameterized dtype would > not have a major impact on current code) > > If we go with a parameterized by encoding string dtype, then we can pick > sensible defaults, and let users use what they know best fits their > use-cases. > > As for python2 -- it is on the way out, I think we should keep the 'U' and > 'S' dtypes as they are for backward compatibility and move forward with the > new one(s) in a way that is optimized for py3. And it would map to a py2 > Unicode type. > > The only catch I see in that is what to do with bytes -- we should have a > numpy dtype that matches the bytes model -- fixed length bytes that map to > python bytes objects. (this is almost what teh void type is yes?) but then > under py2, would a bytes object (py2 string) map to numpy 'S' or numpy > bytes objects?? > > @Francesc: -- one more question for you: > > How important is it for pytables to match the numpy storage to the hdf > storage byte for byte? i.e. would it be a killer if encoding / decoding > happened every time at the boundary? I'm guessing yes, as this would have > been solved long ago if not. > The PyTables team decided some time ago that it was a waste of time and resources to maintain the internal HDF5 interface, and that it would be better to switch to h5py for the low I/O communication with HDF5 (btw, we just received a small NumFOCUS grant for continue the ongoing work on this; thanks guys!). This means that PyTables will be basically agnostic about this sort of encoding issues, and that the important package to have in account for interfacing NumPy and HDF5 is just h5py. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion