[issue26175] Fully implement IOBase abstract on SpooledTemporaryFile
Daniel Jewell added the comment: To add something additional here: The current documentation for tempfile.SpooledTemporaryFile indicates "This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size[...]" (see https://docs.python.org/3/library/tempfile.html) Except that SpooledTemporaryFile *doesn't* act _exactly_ like TemporaryFile() - as documented here. TemporaryFile() returns an "_io.BufferedRandom" which implements all of the expected "file-like" goodies - like [.readable, .seekable, ...etc]. SpooledTemporaryFile does not. In comparing to the 2.x docs, the text for SpooledTemporaryFile() appears to be identical or nearly identical to the 3.8.x current docs. This goes in line with what has already been discussed here. At a _very minimum_ the documentation should be updated to reflect the current differences between TemporaryFile() and SpooledTemporaryFile(). Perhaps an easier change would be to extend TemporaryFile() to have a parameter that enables functionality similar to SpooledTemporaryFile? Namely, *memory-only* storage up to a max_size? Or perhaps there is an alternate solution that already exists? Ultimately, the functionality that appears to be missing is to be able to easily create a file-like object backed primarily by memory for reading/writing data ... (i.e. one 100% compatible with 'the usual' file objects returned by open()...) -- nosy: +danieljewell ___ Python tracker <https://bugs.python.org/issue26175> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27580] CSV Null Byte Error
Daniel Jewell added the comment: Forgive my frustration, but @Skip I really don't see how the definition of CSV relating to Excel (or Gnumeric or LibreOffice) has any relevance as to whether or not the module (and perhaps Python more generally) supports chr(0x00) as a delimiter. (Neither you nor I get to decide how someone else might write output data...) While the module is called CSV, it's really not just *Comma* Separated Values - rather it's a rough approximation of a database table with an optional header row where rows/records are separated by and fields are separated by . Sometimes the record separator is chr(0x2c) (e.g. a comma) sometimes it's chr(0x09) (e.g. a tab - or in ASCII parlance "Horizontal Tab/HT") ... or maybe even the actual ASCII "Record Separator" character (e.g. chr(0x1e)) ... or maybe NUL chr(0x00). (1) The module should be 100% agnostic about the separator - the current (3.8.3) error text when trying to use csv.reader(..., delimiter=chr(0x00)) is "TypeError: "delimiter" must be a 1-character string" ... well, chr(0x00) *is* a 1-character string. It's not a 1-character *printable* string... But then again neither is chr(0x1e) (ASCII "RS" Record Separator) .. and csv.reader(..., delimiter=chr(0x1e)) appears to work (I haven't tried actual data yet). (1a) The use of chr(0x00) or '\0' is used quite often in the *NIX world as a convenient record separator that doesn't have escaping problems because by it's very nature it's non-printable. e.g. "find . -iname "*something*" -print0 | xargs -0 " ... As to the difficulty in handling 0x00 characters, I dunno ... it appears that GNU find, xargs, gawk... same with FreeBSD. FreeBSD writes the output for "-print0" like this: https://github.com/freebsd/freebsd/blob/508f3673dec94b03f89b9ce9569390d6d9b86a89/usr.bin/find/function.c#L1383 ... and bsd xargs handles it too. I haven't looked at the CPython source to see what's going on - it might be tricky to modify the code to support this... (but then again, IMHO, this sort of thing should have been a consideration in the first place) I suppose in many ways, the very existence of this specific issue at all is just one example of what seems to be a larger issue with Python's overall development: It's a great language for *many* things and in many ways. But I've run into so many little fringe "gotchas" where something doesn't work or is limited in some way because, seemingly, functionality is designed around/defined by a practical-example-use-case and not what is or might be *possible* (e.g. the CSV-as-only-a-spreadsheet-interface example -- and I really *don't* mean that as a personal attack @Skip - I am very appreciative of the time and effort you and everyone has poured into the project...) Is it possible to write a NUL (0x00) character to a file? Through a *NIX pipe? You bet. (I got a little rant-y .. sorry... I'm sure there's a _lot_ more going on underneath the covers and there are a lot of factors - not limited to just the csv module - as you mentioned. I just really feel like something is "off". Maybe it's my brain - ha. :)) -- nosy: +danieljewell type: enhancement -> behavior versions: +Python 3.7, Python 3.8 ___ Python tracker <https://bugs.python.org/issue27580> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue19081] zipimport behaves badly when the zip file changes while the process is running
Daniel Jewell added the comment: In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w', compression=zipfile.ZIP_LZMA) for example.) At first I wondered why zipimport essentially duplicates a lot from zipfile but then realized (after reading some of the commit messages around the pure-python rewrite of zipimport a few years ago) that since zipimport is called as part of startup, there's a need to avoid importing certain modules. I'm wondering if this specific issue with zipimport is possibly more of an indicator of a larger issue? Specifically: * The duplication of code between zipfile and zipimport seems like a potential source of bugs - I get the rationale but perhaps the "base" ZIP functionality ought to be refactored out of both zipimport and zipfile so they can share... And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely imports more than zipimport but I haven't looked at what those imports are doing extensively. Ultimately: the behavior of the new ZipImport appears to be, essentially, the same as zipimport.c: Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter gets registered into sys.path_hooks. When you import anything in a zip file, all of the paths get cached into sys.path_importer_cache as zipimport.zipimporter objects. The zipimporter objects, when instantiated, run zipimport._read_directory() which returns a low level dict with each key being a filename (module) and each value being a tuple of low-level metadata about that file including the byte offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330 or so). This is then stored in zipimporter._files. Critically, the contents of the zip file are not decompressed at this stage: only the metadata of what is in the zip file and (most importantly) where it is, is stored in memory: only when a module is actually called for loading is the data read utilizing the cached metadata. There appears to be no provision for (a) verifying that the zip file itself hasn't changed or (b) refreshing the metadata. So it's no surprise really that this error is happening: the cached file contents metadata instructs zipimporter to decompress a specific byte offset in the zip file *when an import is called*. If the zip file changes on disk between the metadata scan (e.g. first read of the zip file) and actual loading, bam: error. There appear to be several ways to fix this ... I'm not sure of the best: * Possibly lock the ZIP file on first import so it doesn't change (this presents many new issues) * Rescan the ZIP before each import; but the point of caching the contents appears to be the avoidance of this * Hash the entire file and compare (expensive CPU-wise) * Rely on modified time? e.g. cache the whole zip modified time at read and then if that's not the same invalidate the cache and rescan * Cache the entire zip file into memory at first load - this has some advantages (can store the ZIP data compressed; would make the import all or nothing; faster?) But then there would need to be some kind of variable to limit the size/total size - it becomes a memory hog... -- nosy: +danieljewell ___ Python tracker <https://bugs.python.org/issue19081> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com