Hi Andrew, > I want to find duplicates in that 3 TB of files, plus another location > on another computer, so I might write a little program to do that by > collecting filenames, checksums, file sizes, etc. in each location and > then processing the results afterwards.
If you're only initially interested in files containing the same bytes and can afford to throw CPU at it then run b2sum(1) with -z on every file, sort to bring identical digests together, and then uniq(1) to keep only the duplicate-digest lines with -Dw. An alternative digester is xxHash. https://github.com/Cyan4973/xxHash?tab=readme-ov-file#readme > I had 1 TB and 2 TB hard disks and used btrfs to combine them. btrfs supports having one disk block be part of multiple files. There are programs which will incrementally de-duplicate using this. You could keep running them between adding another tranche of files. https://btrfs.readthedocs.io/en/latest/Deduplication.html -- Cheers, Ralph. -- Next meeting: Online, Jitsi, Tuesday, 2024-06-04 20:00 Check to whom you are replying Meetings, mailing list, IRC, ... http://dorset.lug.org.uk New thread, don't hijack: mailto:dorset@mailman.lug.org.uk