On Jun 5, 9:14 am, Nathaniel Rook <nr...@wesleyan.edu> wrote: > Hello, all! > > I've recently encountered a bug in NumPy's string arrays, where the 00 > ASCII character ('\x00') is not stored properly when put at the end of a > string. > > For example: > > Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) > [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy > >>> print numpy.version.version > 1.3.0 > >>> arr = numpy.empty(1, 'S2') > >>> arr[0] = 'ab' > >>> arr > array(['ab'], > dtype='|S2') > >>> arr[0] = 'c\x00' > >>> arr > array(['c'], > dtype='|S2') > > It seems that the string array is using the 00 character to pad strings > smaller than the maximum size, and thus is treating any 00 characters at > the end of a string as padding. Obviously, as long as I don't use > smaller strings, there is no information lost here, but I don't want to > have to re-add my 00s each time I ask the array what it is holding.
I am going to guess that it is done this way for the sake of interoperability with Fortran, and that it is deliberate behavior. Also, if it were accidental behavior, then it would probably happen for internal nul bytes, but it doesn't. The workaround I recommend is to add a superfluous character on the end: >>> numpy.array(['a\0x'],'S3') array(['a\x00x'], dtype='|S3') Then chop off the last character. (However it might turn out that padding as necessary performs better.) > Is this a well-known bug already? I couldn't find it on the NumPy bug > tracker, but I could have easily missed it, or it could be triaged, > deemed acceptable because there's no better way to deal with > arbitrary-length strings. Is there an easy way to avoid this problem? > Pretty much any performance-intensive part of my program is going to be > dealing with these arrays, so I don't want to just replace them with a > slower dictionary instead. > > I can't imagine this issue hasn't come up before; I encountered it by > using NumPy arrays to store Python structs, something I can imagine is > done fairly often. As such, I apologize for bringing it up again! I doubt a very high percentage of people who use numpy do character manipulation, so I could see it as something that hasn't come up before. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list