Hello Guix of the hackathon! :-) Every time core packages like libc or Bash are changed, we end up rebuilding (or re-downloading) the world, and storing multiple copies of mostly-identical packages. The daemon implements a simple deduplication strategy, where identical files found in the store are hard-linked together (via the /gnu/store/.links directory), so as to achieve single-instance storage.
I’ve upgraded my system to current master (with the Bash fix), and I still have a bunch of old profile generations, so I wanted to see how efficient deduplication is. Some characteristics of my system: --8<---------------cut here---------------start------------->8--- $ ls -ld /gnu/store/*-bash-4.3 /gnu/store/*-bash-4.3.25 | wc -l 9 $ ls -ld /gnu/store/*-glibc-2.{20,19} | wc -l 4 $ guix package -I | wc -l 178 $ guix package --list-generations | grep ^Gen | wc -l 31 --8<---------------cut here---------------end--------------->8--- The attached script has allowed me to see how much deduplication is taking place in my store. The first question is: how much disk space is actually saved by deduplication? --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (saved-space (links)) $43 = 15128046966 scheme@(guile-user)> (/ $43 GiB 1.) $44 = 14.089091649278998 --8<---------------cut here---------------end--------------->8--- That’s 14 GiB saved by deduplication (!). We can plot the cumulative distribution function (CDF) of the number of hard links or each deduplicated file:
It’s a bit hard to read, but we see that most than half of the files under /gnu/store/.links have 1 or 2 hard links, and 80% have 4 hard links or more. The CDF of the size of deduplicated files is like this:
Half of the deduplicated files are 4 KiB or less, and 90% of the deduplicated files are between 0 and 64 KiB. The code for that is attached below; it uses Andy’s cool Guile-Charting. Happy hacking! Ludo’.
(use-modules (charting) ((guix store) #:select (%store-prefix)) (ice-9 ftw) (ice-9 match) (srfi srfi-1) (srfi srfi-9)) (define-record-type <deduplicated-file> (deduplicated-file name size links) deduplicated-file? (name deduplicated-file-name) (size deduplicated-file-size) (links deduplicated-file-link-count)) (define %links-directory (string-append (%store-prefix) "/.links")) (define (links) "Return a list of <deduplicated-file>." (file-system-fold (const #t) (lambda (file stat result) ;leaf (cons (deduplicated-file file (stat:size stat) (stat:nlink stat)) result)) (lambda (directory stat result) ;down result) (lambda (directory stat result) ;up result) (const #f) ;skip (lambda (file stat errno result) (error "i/o error" file errno)) '() %links-directory lstat)) (define KiB (expt 2 10)) (define MiB (* KiB KiB)) (define GiB (* KiB MiB)) (define (saved-space files) "Return the total amount of saved space given FILES, a list of <deduplicated-file>." (fold (lambda (df result) (match df (($ <deduplicated-file> name size links) (+ result (* size (- links 1)))))) 0 files)) (define (cumulative-distribution files property) "Return a list of (VALUE . COUNT) pairs representing the number of FILES whose PROPERTY is VALUE or less." (define (file<? df1 df2) (< (property df1) (property df2))) (fold (lambda (df result) (match result (((value . count) . rest) (let ((current (property df))) (if (= value current) (alist-cons value (+ count 1) rest) (alist-cons current (+ count 1) result)))) (_ (alist-cons (property df) 1 result)))) '() (sort files file<?))) (define* (plot-distribution distribution output #:key subtitle max-x group-name x-axis-label) "Plot DISTRIBUTION, and produce file OUTPUT." (define (log2 n) (let loop ((result 1)) (if (zero? (ash n (- result))) (- result 1) (loop (+ 1 result))))) (define (format-log2-tick tick) ;; (string-append "2^" ;; (number->string (log2 (inexact->exact tick)))) (number->string (inexact->exact tick))) (define (adjust-items total) (lambda (x) (match x ;; XXX: Filter out the two cases that would give us a numerical ;; overflow. ((0 . _) #f) ((1 . _) #f) ((value . count) (and (or (not max-x) (< value max-x)) (cons value (* 100. (/ count total)))))))) (match distribution (((_ . total) . rest) (let ((percent (filter-map (adjust-items total) distribution))) (make-scatter-plot #:title (string-append "cumulative distribution by " subtitle) #:data `((,group-name ,@percent)) #:x-axis-label x-axis-label #:y-axis-label "%" #:tick-label-formatter format-log2-tick #:log-x-base 2 #:min-x 1 #:max-y 101 #:write-to-png output)))))
signature.asc
Description: PGP signature