Hi! l...@gnu.org (Ludovic Courtès) skribis:
> ‘LocalStore::removeUnusedLinks’ traverses all the entries in > /gnu/store/.links and calls lstat(2) on each one of them and checks > ‘st_nlink’ to determine whether they can be deleted. > > There are two problems: lstat(2) can be slow on spinning disks as found > on hydra.gnu.org, and the algorithm is proportional in the number of > entries in /gnu/store/.links, which is a lot on hydra.gnu.org. Taking a step back, we could perhaps mitigate this with heuristics to reduce the number of entries in .links: 1. Do not deduplicate files with a size lower than some threshold; 2. Delete links with st_nlink <= 3 (instead of <= 2); that would prevent *further* deduplication of those files, but they’d already have two instances sharing the same inode; 3. Stop deduplicating once the number of entries in .links has reached a certain threshold. For #1, a key insight is that about 30% of the files actually deduplicated (in my store, where /gnu/store/.links has 2.2M entries) are smaller than 1 KiB:
… but 85% of them have at most 4 links (thus, saving up to 2 KiB by deduplicating):
On my laptop, we’re talking about space savings of 325 MiB, a tiny fraction of my store: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (saved-space (filter (lambda (file) (< (deduplicated-file-size file) 1024)) l)) $40 = 325914739 --8<---------------cut here---------------end--------------->8--- Files smaller than 1 KiB represent 35% of the entries in .links:
By not deduplicating files smaller than 1 KiB, we’d reduce the number of entries by 35%, which should already have a tangible impact on performance. It’d be a “mitigation” more than a “fix”, but it has a good work/reward ratio. We could conduct a similar analysis for #2. #3 is more difficult to implement because you cannot know the number of entries in .links until you’ve traversed it (note that currently deduplication stops when link(2) returns ENOSPC in .links). I’m attaching the script I’ve used for that, derived from an earlier experiment¹. You’re welcome to give it a spin! Thoughts? Ludo’. ¹ https://lists.gnu.org/archive/html/guix-devel/2014-09/msg00422.html
(use-modules (charting) ((guix store) #:select (%store-prefix)) (ice-9 ftw) (ice-9 match) (srfi srfi-1) (srfi srfi-9)) (define-record-type <deduplicated-file> (deduplicated-file name size links) deduplicated-file? (name deduplicated-file-name) (size deduplicated-file-size) (links deduplicated-file-link-count)) (define %links-directory (string-append (%store-prefix) "/.links")) (define (links) "Return a list of <deduplicated-file>." (file-system-fold (const #t) (lambda (file stat result) ;leaf (cons (deduplicated-file file (stat:size stat) (stat:nlink stat)) result)) (lambda (directory stat result) ;down result) (lambda (directory stat result) ;up result) (const #f) ;skip (lambda (file stat errno result) (error "i/o error" file errno)) '() %links-directory lstat)) (define KiB (expt 2 10)) (define MiB (* KiB KiB)) (define GiB (* KiB MiB)) (define (saved-space files) "Return the total amount of saved space given FILES, a list of <deduplicated-file>." (fold (lambda (df result) (match df (($ <deduplicated-file> name size links) (when (< links 2) (error "too few links" name links)) (+ result (* size (- links 2)))))) 0 files)) (define (cumulative-distribution files property) "Return a list of (VALUE . COUNT) pairs representing the number of FILES whose PROPERTY is VALUE or less." (define (file<? df1 df2) (< (property df1) (property df2))) (fold (lambda (df result) (match result (((value . count) . rest) (let ((current (property df))) (if (= value current) (alist-cons value (+ count 1) rest) (alist-cons current (+ count 1) result)))) (_ (alist-cons (property df) 1 result)))) '() (sort files file<?))) (define* (plot-distribution distribution output #:key (subtitle "SUBTITLE") max-x (group-name "GROUP") x-axis-label) "Plot DISTRIBUTION, and produce file OUTPUT." (define (log2 n) (let loop ((result 1)) (if (zero? (ash n (- result))) (- result 1) (loop (+ 1 result))))) (define (format-log2-tick tick) ;; (string-append "2^" ;; (number->string (log2 (inexact->exact tick)))) (number->string (inexact->exact tick))) (define (adjust-items total) (lambda (x) (match x ;; XXX: Filter out the two cases that would give us a numerical ;; overflow. ((0 . _) #f) ((1 . _) #f) ((value . count) (and (or (not max-x) (< value max-x)) (cons value (* 100. (/ count total)))))))) (match distribution (((_ . total) . rest) (let ((percent (filter-map (adjust-items total) distribution))) (make-scatter-plot #:title (string-append "Cumulative distribution by " subtitle) #:data `((,group-name ,@percent)) #:x-axis-label x-axis-label #:y-axis-label "%" #:tick-label-formatter format-log2-tick #:log-x-base 2 #:min-x 1 #:max-y 101 #:write-to-png output))))) #! Examples (define l (links)) ;this is the expensive part (plot-distribution (cumulative-distribution l deduplicated-file-link-count) "/tmp/nlink.png" #:x-axis-label "number of hard links" #:subtitle "hard link count" #:max-x 2048 #:group-name "nlinks") (plot-distribution (cumulative-distribution (filter (lambda (file) (< (deduplicated-file-size file) 1024)) l) deduplicated-file-link-count) "/tmp/nlink-small.png" #:x-axis-label "number of hard links" #:subtitle "hard link count for files < 1KiB" #:max-x 2048 #:group-name "nlinks") (plot-distribution (cumulative-distribution l deduplicated-file-size) "/tmp/size.png" #:x-axis-label "file size" #:subtitle "file size" #:max-x 32768 #:group-name "size (B)") (plot-distribution (cumulative-distribution (filter (lambda (f) (> (deduplicated-file-link-count f) 2)) l) deduplicated-file-size) "/tmp/size-deduplicated.png" #:x-axis-label "file size" #:subtitle "size for files actually deduplicated" #:max-x 32768 #:group-name "size (B)") (plot-distribution (cumulative-distribution (filter (lambda (file) (< (deduplicated-file-size file) 1024)) l) (lambda (file) (* (deduplicated-file-size file) (- (deduplicated-file-link-count file) 2)))) "/tmp/size-savings.png" #:x-axis-label "savings" #:subtitle "savings for files < 1KiB" #:max-x 32768 #:group-name "savings (B)") !#