Package: diffoscope Version: 224 Severity: wishlist It would be nice if diffoscope could help highlight that HTML files differ in the text output or if they differ only in the non-text HTML bytes like the page title, the stylesheet etc.
The proposal would be that by default diffoscope would convert HTML files to text, diff that and if there were text differences then display them, with a comment saying these are differences in the text. In situations where the text does not differ, diffoscope would do a line diff of the HTML file itself, with a comment saying the text of the two files was not different. This is useful in some situations like when comparing old versions of a document with newer versions of a document or similar. In particular it would have been useful when preparing this mail to debian-mentors: https://lists.debian.org/msgid-search/197a4671e7694c24424b91b4d7288867c0c85d9b.ca...@debian.org Since there are many different tools for conversion of HTML to text and each of them have different bugs and features, probably this feature should allow the user to choose the tool they want to use for this. $ head -vn-0 *.html ==> bar.html <== <html> <head> <title>bar</title> <style> <!-- BODY { BACKGROUND: #FFFFFF; COLOR: #000000; --> </style> </head> <body> <p> bar </p> </body> </html> ==> foo.html <== <html> <head> <title>foo</title> <style> <!-- BODY { BACKGROUND: #000000; COLOR: #FFFFFF; --> </style> </head> <body> <p> foo </p> </body> </html> $ diffoscope foo.html bar.html --- foo.html +++ bar.html @@ -1,17 +1,17 @@ <html> <head> -<title>foo</title> +<title>bar</title> <style> <!-- BODY { -BACKGROUND: #000000; -COLOR: #FFFFFF; +BACKGROUND: #FFFFFF; +COLOR: #000000; --> </style> </head> <body> <p> -foo +bar </p> </body> </html> $ diff -u <(w3m -dump foo.html) <(w3m -dump bar.html) --- /dev/fd/63 2022-10-22 08:52:33.581676470 +0800 +++ /dev/fd/62 2022-10-22 08:52:33.585676477 +0800 @@ -1,2 +1,2 @@ -foo +bar $ diff -u <(html2text foo.html) <(html2text bar.html) --- /dev/fd/63 2022-10-22 08:54:43.793859066 +0800 +++ /dev/fd/62 2022-10-22 08:54:43.781859049 +0800 @@ -1 +1 @@ -foo +bar -- System Information: Debian Release: bookworm/sid APT prefers testing-debug APT policy: (900, 'testing-debug'), (900, 'testing'), (800, 'unstable-debug'), (800, 'unstable'), (790, 'buildd-unstable'), (700, 'experimental-debug'), (700, 'experimental'), (690, 'buildd-experimental') merged-usr: no Architecture: amd64 (x86_64) Kernel: Linux 6.0.0-1-amd64 (SMP w/8 CPU threads; PREEMPT) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages diffoscope depends on: ii diffoscope-minimal 224 Versions of packages diffoscope recommends: ii abootimg 0.6-1+b2 ii acl 2.3.1-1 ii androguard 3.4.0~a1-5 ii apksigner 31.0.2-1 ii apktool 2.6.1+dfsg.1-2 ii binutils-multiarch 2.39-8 ii bzip2 1.0.8-5+b1 ii caca-utils 0.99.beta20-3 ii colord 1.4.6-1 ii coreboot-utils 4.15~dfsg-2 ii db-util 5.3.1+nmu1 ii default-jdk [java-sdk] 2:1.11-72 ii default-jdk-headless 2:1.11-72 pn device-tree-compiler <none> pn docx2txt <none> ii e2fsprogs 1.46.6~rc1-1+b1 ii enjarify 1:1.0.3-5 ii ffmpeg 7:5.1.2-1 ii fontforge-extras 1:20220308~dfsg-1 pn fp-utils <none> ii genisoimage 9:1.1.11-3.4 ii gettext 0.21-9 ii ghc 9.0.2-4 ii ghostscript 9.56.1~dfsg-1 ii giflib-tools 5.2.1-2.5 ii gnumeric 1.12.52-1 ii gnupg 2.2.39-1 ii gnupg-utils 2.2.39-1+b1 pn hdf5-tools <none> ii imagemagick 8:6.9.11.60+dfsg-1.3+b3 ii imagemagick-6.q16 [imagemagick] 8:6.9.11.60+dfsg-1.3+b3 ii jsbeautifier 1.14.4-1 ii libarchive-tools 3.6.0-1 pn libxmlb-dev <none> ii llvm 1:14.0-55.2+b1 ii lz4 [liblz4-tool] 1.9.4-1 pn mono-utils <none> ii ocaml-nox 4.13.1-3 pn odt2txt <none> pn oggvideotools <none> ii openjdk-11-jdk [java-sdk] 11.0.17+8-2 ii openssh-client 1:9.0p1-1+b2 ii openssl 3.0.5-4 ii pgpdump 0.34-1 ii poppler-utils 22.08.0-2.1 pn procyon-decompiler <none> ii python3-argcomplete 2.0.0-1 ii python3-binwalk 2.3.3+dfsg1-2 ii python3-debian 0.1.48 ii python3-defusedxml 0.7.1-2 ii python3-guestfs 1:1.48.4-2+b1 hi python3-jsondiff 1.1.1-4 ii python3-pdfminer 20220319+dfsg-1 ii python3-progressbar 2.5-3 ii python3-pypdf2 2.11.0-1 ii python3-pyxattr 0.7.2-2+b1 ii python3-rpm 4.17.1.1+dfsg-1 ii python3-tlsh 3.4.4+20151206-1.4+b2 pn r-base-core <none> pn radare2 <none> ii rpm2cpio 4.17.1.1+dfsg-1 ii sng 1.1.0-4 ii sqlite3 3.39.4-1 ii squashfs-tools 1:4.5.1-1 ii tcpdump 4.99.1-4+b1 ii u-boot-tools 2022.10+dfsg-1 ii unzip 6.0-27 pn wabt <none> pn xmlbeans <none> ii xxd 2:9.0.0626-1 ii xz-utils 5.2.5-2.1 ii zip 3.0-12 ii zstd 1.5.2+dfsg-1 Versions of packages diffoscope suggests: ii libjs-jquery 3.6.1+dfsg+~3.5.14-1 -- no debconf information -- bye, pabs https://wiki.debian.org/PaulWise
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Reproducible-builds mailing list Reproducible-builds@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/reproducible-builds