* Fay Stegerman <[email protected]> [2024-04-11 04:28]:
> * Holger Levsen <[email protected]> [2024-04-11 02:14]:
> > > unzip does seem to extract all the files, though it errors out. Not sure
> > > what
> > > diffoscope should do here. This is definitely a broken ZIP file. That
> > > bug
> > > should probably be reported against libscout or whatever tooling it used
> > > to
> > > create that JAR.
> >
> > I agree it's more complicated, but fundamentally, diffoscope should *not*
> > crash
> > here! (but rather report the broken zip file.)
>
> I think we all agree it shouldn't crash :)
>
> What I meant is that I'm not sure it should simply catch the error, report the
> file as broken, and not attempt extraction, or if it makes sense to attempt to
> work around this issue, at least in cases like this specific one where the
> entries are exact duplicates and the files can presumably be safely extracted.
> I think my workaround (which could be implemented slightly differently as
> well,
> without modifying the ZipFile, but processing it differently in diffoscope)
> would accomplish that for this JAR at least. I could make an MR for that.
> Though as I said I will also report this upstream to cpython, probably
> tomorrow.
>
> - Fay
The attached patch avoids the crash in this case, FWIW. I would still recommend
catching the error for other cases.
- Fay
diff --git a/diffoscope/comparators/zip.py b/diffoscope/comparators/zip.py
index 2a27042a..4bfb1527 100644
--- a/diffoscope/comparators/zip.py
+++ b/diffoscope/comparators/zip.py
@@ -182,7 +182,12 @@ class ZipDirectory(Directory, ArchiveMember):
class ZipContainer(Archive):
def open_archive(self):
- return zipfile.ZipFile(self.source.path, "r")
+ zf = zipfile.ZipFile(self.source.path, "r")
+ self.name_to_info = {}
+ for info in zf.infolist():
+ if info.filename not in self.name_to_info:
+ self.name_to_info[info.filename] = info
+ return zf
def close_archive(self):
self.archive.close()
@@ -199,7 +204,8 @@ class ZipContainer(Archive):
).encode(sys.getfilesystemencoding(), errors="replace")
try:
- with self.archive.open(member_name) as source, open(
+ info = self.name_to_info[member_name]
+ with self.archive.open(info) as source, open(
targetpath, "wb"
) as target:
shutil.copyfileobj(source, target)