[issue26175] Fully implement IOBase abstract on SpooledTemporaryFile

2020-05-13 Thread Daniel Jewell

Daniel Jewell  added the comment:

To add something additional here:

The current documentation for tempfile.SpooledTemporaryFile indicates "This 
function operates exactly as TemporaryFile() does, except that data is spooled 
in memory until the file size exceeds max_size[...]" (see 
https://docs.python.org/3/library/tempfile.html)

Except that SpooledTemporaryFile *doesn't* act _exactly_ like TemporaryFile() - 
as documented here. TemporaryFile() returns an "_io.BufferedRandom" which 
implements all of the expected "file-like" goodies - like [.readable, 
.seekable, ...etc]. SpooledTemporaryFile does not.

In comparing to the 2.x docs, the text for SpooledTemporaryFile() appears to be 
identical or nearly identical to the 3.8.x current docs. This goes in line with 
what has already been discussed here.

At a _very minimum_ the documentation should be updated to reflect the current 
differences between TemporaryFile() and SpooledTemporaryFile(). 

Perhaps an easier change would be to extend TemporaryFile() to have a parameter 
that enables functionality similar to SpooledTemporaryFile? Namely, 
*memory-only* storage up to a max_size? Or perhaps there is an alternate 
solution that already exists? 

Ultimately, the functionality that appears to be missing is to be able to 
easily create a file-like object backed primarily by memory for reading/writing 
data ... (i.e. one 100% compatible with 'the usual' file objects returned by 
open()...)

--
nosy: +danieljewell

___
Python tracker 
<https://bugs.python.org/issue26175>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27580] CSV Null Byte Error

2020-05-29 Thread Daniel Jewell


Daniel Jewell  added the comment:

Forgive my frustration, but @Skip I really don't see how the definition of CSV 
relating to Excel (or Gnumeric or LibreOffice) has any relevance as to whether 
or not the module (and perhaps Python more generally) supports chr(0x00) as a 
delimiter. (Neither you nor I get to decide how someone else might write output 
data...) 

 While the module is called CSV, it's really not just *Comma* Separated Values 
- rather it's a rough approximation of a database table with an optional header 
row where rows/records are separated by  and fields are 
separated by . Sometimes the record separator is chr(0x2c) 
(e.g. a comma) sometimes it's chr(0x09) (e.g. a tab - or in ASCII parlance 
"Horizontal Tab/HT") ... or maybe even the actual ASCII "Record Separator" 
character (e.g. chr(0x1e)) ... or maybe NUL chr(0x00). 

(1) The module should be 100% agnostic about the separator - the current 
(3.8.3) error text when trying to use csv.reader(..., delimiter=chr(0x00)) is 
"TypeError: "delimiter" must be a 1-character string" ... well, chr(0x00) *is* 
a 1-character string. It's not a 1-character *printable* string... But then 
again neither is chr(0x1e) (ASCII "RS" Record Separator) .. and csv.reader(..., 
delimiter=chr(0x1e)) appears to work (I haven't tried actual data yet). 


(1a) The use of chr(0x00) or '\0' is used quite often in the *NIX world as a 
convenient record separator that doesn't have escaping problems because by it's 
very nature it's non-printable. e.g. "find . -iname "*something*" -print0 | 
xargs -0 " ... 

As to the difficulty in handling 0x00 characters, I dunno ... it appears that 
GNU find, xargs, gawk... same with FreeBSD. FreeBSD writes the output for 
"-print0" like this: 
https://github.com/freebsd/freebsd/blob/508f3673dec94b03f89b9ce9569390d6d9b86a89/usr.bin/find/function.c#L1383
 ... and bsd xargs handles it too. I haven't looked at the CPython source to 
see what's going on - it might be tricky to modify the code to support this... 
(but then again, IMHO, this sort of thing should have been a consideration in 
the first place) 

I suppose in many ways, the very existence of this specific issue at all is 
just one example of what seems to be a larger issue with Python's overall 
development: It's a great language for *many* things and in many ways. But I've 
run into so many little fringe "gotchas" where something doesn't work or is 
limited in some way because, seemingly, functionality is designed 
around/defined by a practical-example-use-case and not what is or might be 
*possible* (e.g. the CSV-as-only-a-spreadsheet-interface example -- and I 
really *don't* mean that as a personal attack @Skip - I am very appreciative of 
the time and effort you and everyone has poured into the project...) Is it 
possible to write a NUL (0x00) character to a file? Through a *NIX pipe? You 
bet. 

(I got a little rant-y .. sorry... I'm sure there's a _lot_ more going on 
underneath the covers and there are a lot of factors - not limited to just the 
csv module - as you mentioned. I just really feel like something is "off". 
Maybe it's my brain - ha. :))

--
nosy: +danieljewell
type: enhancement -> behavior
versions: +Python 3.7, Python 3.8

___
Python tracker 
<https://bugs.python.org/issue27580>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19081] zipimport behaves badly when the zip file changes while the process is running

2020-09-13 Thread Daniel Jewell


Daniel Jewell  added the comment:

In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has 
supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport 
assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create 
zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w', 
compression=zipfile.ZIP_LZMA) for example.)

At first I wondered why zipimport essentially duplicates a lot from zipfile but 
then realized (after reading some of the commit messages around the pure-python 
rewrite of zipimport a few years ago) that since zipimport is called as part of 
startup, there's a need to avoid importing certain modules. 

I'm wondering if this specific issue with zipimport is possibly more of an 
indicator of a larger issue? 

Specifically:

* The duplication of code between zipfile and zipimport seems like a potential 
source of bugs - I get the rationale but perhaps the "base" ZIP functionality 
ought to be refactored out of both zipimport and zipfile so they can share... 
And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely 
imports more than zipimport but I haven't looked at what those imports are 
doing extensively. 

Ultimately: the behavior of the new ZipImport appears to be, essentially, the 
same as zipimport.c:

Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter 
gets registered into sys.path_hooks. When you import anything in a zip file, 
all of the paths get cached into sys.path_importer_cache as 
zipimport.zipimporter objects.

The zipimporter objects, when instantiated, run zipimport._read_directory() 
which returns a low level dict with each key being a filename (module) and each 
value being a tuple of low-level metadata about that file including the byte 
offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330 
or so). This is then stored in zipimporter._files.

Critically, the contents of the zip file are not decompressed at this stage: 
only the metadata of what is in the zip file and (most importantly) where it 
is, is stored in memory: only when a module is actually called for loading is 
the data read utilizing the cached metadata. There appears to be no provision 
for (a) verifying that the zip file itself hasn't changed or (b) refreshing the 
metadata. So it's no surprise really that this error is happening: the cached 
file contents metadata instructs zipimporter to decompress a specific byte 
offset in the zip file *when an import is called*. If the zip file changes on 
disk between the metadata scan (e.g. first read of the zip file) and actual 
loading, bam: error.

There appear to be several ways to fix this ... I'm not sure of the best:

* Possibly lock the ZIP file on first import so it doesn't change (this 
presents many new issues)
* Rescan the ZIP before each import; but the point of caching the contents 
appears to be the avoidance of this
* Hash the entire file and compare (expensive CPU-wise)
* Rely on modified time? e.g. cache the whole zip modified time at read and 
then if that's not the same invalidate the cache and rescan
* Cache the entire zip file into memory at first load - this has some 
advantages (can store the ZIP data compressed; would make the import all or 
nothing; faster?) But then there would need to be some kind of variable to 
limit the size/total size - it becomes a memory hog...

--
nosy: +danieljewell

___
Python tracker 
<https://bugs.python.org/issue19081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com