Hi Ludo, On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès <l...@gnu.org> wrote:
> I wanted to evaluate that by looking at store items corresponding to > subsequent revisions of a package (be it different versions or rebuilds > induced by dependencies), and this is what the program below does. > > Here are preliminary results for a few packages: [...] > The reason I’m looking at this is to understand how much would be gained > in terms of bandwidth usage if we were able to avoid downloading > individual files already in the store. It would seem to be rather > encouraging. This is really interesting. Especially through normal distribution and co. lens. The mean needs to be weighed: 10% of 180MB is not the same than 55% of 1MB (icecat vs diffoscope, from “guix size” and I could misread the numbers). The variance between revisions is highly variable (from 0 to significant). The one between packages says how fat or fit the normal distribution is. A smarter distribution could lead to some bandwith saving. The question is: predict how many? Is it worth to add a bit of complexity to save corner cases or for all. By corner cases, I mean that for example I am not updating my Emacs at each revisions. <https://data.guix.gnu.org/repository/1/branch/master/package/emacs/output-history> > Below is the program that does that. It grabs revision history from > data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a > “digest” (list of files along with their hash and size), compares > package digests pairwise, and plots the result with Guile-Charting. > Example REPL session: > > --8<---------------cut here---------------start------------->8--- > scheme@(similarities)> (pairwise-similarities (package-archive-contents > "icecat" #:max 4)) > updating substitutes from 'https://ci.guix.gnu.org'... 100.0% > $86 = (17363915/196387883 11380615/98152193) > scheme@(similarities)> (map exact->inexact $86) > $87 = (0.08841642740249916 0.11594865740799087) > > […] > > scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") ) > $100 = (#<<package-instance> version: "4.0.3" output: > "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3"> > #<<package-instance> version: "4.0.3" output: > "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3"> > #<<package-instance> version: "4.0.3" output: > "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3"> > #<<package-instance> version: "4.0.2" output: > "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2"> > #<<package-instance> version: "4.0.2" output: > "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2"> > #<<package-instance> version: "4.0.2" output: > "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2"> > #<<package-instance> version: "4.0.1" output: > "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1"> > #<<package-instance> version: "4.0.1" output: > "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1"> > #<<package-instance> version: "4.0.0" output: > "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0"> > #<<package-instance> version: "4.0.0" output: > "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">) > $101 = (#<<package-instance> version: "3.6.3" output: > "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3"> > #<<package-instance> version: "3.6.2" output: > "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2"> > #<<package-instance> version: "3.6.1" output: > "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1"> > #<<package-instance> version: "3.6.1" output: > "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1"> > #<<package-instance> version: "3.6.1" output: > "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1"> > #<<package-instance> version: "3.6.1" output: > "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1"> > #<<package-instance> version: "3.6.1" output: > "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1"> > #<<package-instance> version: "3.6.0" output: > "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0"> > #<<package-instance> version: "3.6.0" output: > "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0"> > #<<package-instance> version: "3.6.0" output: > "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0"> > #<<package-instance> version: "3.6.0" output: > "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0"> > #<<package-instance> version: "3.5.3" output: > "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3"> > #<<package-instance> version: "3.5.3" output: > "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3"> > #<<package-instance> version: "3.5.1" output: > "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1"> > #<<package-instance> version: "3.5.1" output: > "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1"> > #<<package-instance> version: "3.5.1" output: > "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">) > scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" > "diffoscope" "r-minimal") "/tmp/t.png" #:max 8) > updating substitutes from 'https://ci.guix.gnu.org'... 100.0% > $102 = #<cairo-surface 7f502c7adb10> > --8<---------------cut here---------------end--------------->8--- > > Thoughts? :-) > > Ludo’. > > ;;; Copyright © 2020 Ludovic Courtès <l...@gnu.org> > ;;; Hereby placed under the GNU General Public License, version 3 or later. > > (define-module (similarities) > #:use-module (json) > #:use-module ((gcrypt hash) #:select (open-sha256-port)) > #:use-module ((guix store) #:select (%default-substitute-urls)) > #:use-module ((guix progress) #:hide (dump-port*)) > #:use-module (guix http-client) > #:use-module (guix serialization) > #:use-module (guix scripts substitute) > #:use-module (srfi srfi-1) > #:use-module (srfi srfi-9) > #:use-module (srfi srfi-11) > #:use-module (rnrs bytevectors) > #:use-module (charting) > #:use-module (ice-9 match)) > > > ;;; > ;;; Data Service client. > ;;; > > (define-json-mapping <package-instance> make-package-instance > package-instance? > json->package-instance > (version package-instance-version) > (output package-instance-output "derivation")) > > (define %data-service-base-url > (make-parameter "https://data.guix.gnu.org")) > > (define* (package-instances package #:key (branch "master")) > "Return a list of <package-instance> representing instances of PACKAGE over > time known to the Data Service." > (match (assoc-ref (json->scm > (http-fetch (string-append (%data-service-base-url) > "/repository/1/branch/" > branch "/package/" package > "/output-history.json"))) > "derivations") > (#(lst ...) > (map json->package-instance lst)) > (#f > #f))) > > > ;;; > ;;; Similarity measurement. > ;;; > > (define (port-sha256* port size) ;from (guix scripts challenge) > ;; Like 'port-sha256', but limited to SIZE bytes. > (let-values (((out get) (open-sha256-port))) > (dump-port* port out size) > (close-port out) > (get))) > > (define-record-type <file-info> > (file-info name type size sha256) > file-info? > (name file-info-name) > (type file-info-type) > (size file-info-size) > (sha256 file-info-sha256)) > > (define (archive-contents port) > "Return a list of <file-info> records from the nar read from PORT." > ;; As in (guix scripts challenge), but return <file-info> records that > ;; include file size and ignore symlinks. > (fold-archive (lambda (file type contents result) > (match type > ((or 'regular 'executable) > (match contents > ((port . size) > (cons (file-info file type size > (port-sha256* port size)) > result)))) > ('directory result) > ('directory-complete result) > ('symlink result))) > '() > port > "")) > > (define (narinfo-contents narinfo) ;from (guix scripts challenge) > "Fetch the nar described by NARINFO and return a list representing the file > it contains." > ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents)) > > (define (at-most max-length lst) ;from (guix scripts substitute) > "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise > return its MAX-LENGTH first elements and its tail." > (let loop ((len 0) > (lst lst) > (result '())) > (match lst > (() > (values (reverse result) '())) > ((head . tail) > (if (>= len max-length) > (values (reverse result) lst) > (loop (+ 1 len) tail (cons head result))))))) > > (define* (package-archive-contents package > #:key (max 10) > (substitute-urls %default-substitute-urls)) > "Look at the MAX latest instances of PACKAGE, fetch them, and return a > summary of their contents as returned by 'narinfo-contents'." > (let ((instances (at-most max (package-instances package)))) > (map narinfo-contents > (lookup-narinfos (first substitute-urls) > (map package-instance-output instances))))) > > (define (similarity contents1 contents2) > "Return two values: the ratio of identical bytes between CONTENTS2 and > CONTENTS2, and the ratio of identical files." > (define (matches name) > (lambda (info) > (string=? (file-info-name info) name))) > > (let ((files (delete-duplicates > (append (map file-info-name contents1) > (map file-info-name contents2))))) > (let loop ((files files) > (seen 0) > (identical 0) > (seen-bytes 0) > (identical-bytes 0)) > (match files > (() > (values (/ identical-bytes seen-bytes) > (/ identical seen))) > ((head . tail) > (let ((file1 (find (matches head) contents1)) > (file2 (find (matches head) contents2))) > (cond ((not file1) > (loop tail > (+ seen 1) identical > (+ seen-bytes (file-info-size file2)) > identical-bytes)) > ((not file2) > (loop tail > (+ seen 1) identical > (+ seen-bytes (file-info-size file1)) > identical-bytes)) > (else > (let ((identical? > (and (= (file-info-size file1) > (file-info-size file2)) > (bytevector=? (file-info-sha256 file1) > (file-info-sha256 file2))))) > (loop tail > (+ seen 1) > (if identical? > (+ identical 1) > identical) > (+ seen-bytes > (max (file-info-size file1) > (file-info-size file2))) > (if identical? > (+ identical-bytes (file-info-size file1)) > identical-bytes))))))))))) > > (define (pairwise-similarities contents) > (let loop ((contents contents) > (similarities '())) > (match contents > ((or () (_)) > (reverse similarities)) > ((a b . tail) > (loop (cons b tail) > (cons (similarity a b) similarities)))))) > > (define* (similarity-chart packages file > #:key (max 20)) > (make-bar-chart > "Similarity across subsequent package revisions" > (map (lambda (package) > (let* ((contents (package-archive-contents package > #:max max)) > (similarities (pairwise-similarities contents)) > (labels (iota (length similarities)))) > `(,package > ,@(map (lambda (label ratio) > `(,(* ratio 100.) ,(number->string label))) > labels similarities)))) > packages) > #:chart-params '(#:x-axis-label "package" > #:y-axis-label "similarity ratio (%)") > #:write-to-png file))