Re: [Dorset] Tues evening

Ralph Corderoy Mon, 13 May 2024 02:02:36 -0700

Hi Andrew,

> I want to find duplicates in that 3 TB of files, plus another location
> on another computer, so I might write a little program to do that by
> collecting filenames, checksums, file sizes, etc. in each location and
> then processing the results afterwards.


If you're only initially interested in files containing the same bytes
and can afford to throw CPU at it then run b2sum(1) with -z on every
file, sort to bring identical digests together, and then uniq(1) to keep
only the duplicate-digest lines with -Dw.

An alternative digester is xxHash.
https://github.com/Cyan4973/xxHash?tab=readme-ov-file#readme

> I had 1 TB and 2 TB hard disks and used btrfs to combine them.

btrfs supports having one disk block be part of multiple files.  There
are programs which will incrementally de-duplicate using this.  You
could keep running them between adding another tranche of files.
https://btrfs.readthedocs.io/en/latest/Deduplication.html

-- 
Cheers, Ralph.

-- 
  Next meeting: Online, Jitsi, Tuesday, 2024-06-04 20:00
  Check to whom you are replying
  Meetings, mailing list, IRC, ...  http://dorset.lug.org.uk
  New thread, don't hijack:  mailto:dorset@mailman.lug.org.uk

Re: [Dorset] Tues evening

Reply via email to