Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our dataset: HiRISE has a large set of spacecraft data (images) that could potentially have large amounts of redundancy, or not. Also, other up and coming missions have a large data volume that have a lot of duplicate image info and a small budget; with "d11p" in OpenSolaris there is a good business case to invest in Sun/OpenSolaris rather than buy the cheaper storage (+ linux?) that can simply hold everything as is.
If someone feels like coding a tool up that basically makes a file of checksums and counts how many times a particular checksum get's hit over a dataset, I would be willing to run it and provide feedback. :) -Tim Charles Soto wrote: > Oh, I agree. Much of the duplication described is clearly the result of > "bad design" in many of our systems. After all, most of an OS can be served > off the network (diskless systems etc.). But much of the dupe I'm talking > about is less about not using the most efficient system administration > tricks. Rather, it's about the fact that software (e.g. Samba) is used by > people, and people don't always do things efficiently. > > Case in point: students in one of our courses were hitting their quota by > growing around 8GB per day. Rather than simply agree that "these kids need > more space," we had a look at the files. Turns out just about every student > copied a 600MB file into their own directories, as it was created by another > student to be used as a "template" for many of their projects. Nobody > understood that they could use the file right where it sat. Nope. 7GB of > dupe data. And these students are even familiar with our practice of > putting "class media" on a read-only share (these files serve as similar > "templates" for their own projects - you can create a full video project > with just a few MB in your "project file" this way). > > So, while much of the situation is caused by "bad data management," there > aren't always systems we can employ that prevent it. Done right, dedup can > certainly be "worth it" for my operations. Yes, teaching the user the > "right thing" is useful, but that user isn't there to know how to "manage > data" for my benefit. They're there to learn how to be filmmakers, > journalists, speech pathologists, etc. > > Charles > > > On 7/7/08 9:24 PM, "Bob Friesenhahn" <[EMAIL PROTECTED]> wrote: > > >> On Mon, 7 Jul 2008, Mike Gerdts wrote: >> >>> As I have considered deduplication for application data I see several >>> things happen in various areas. >>> >> You have provided an excellent description of gross inefficiencies in >> the way systems and software are deployed today, resulting in massive >> duplication. Massive duplication is used to ease service deployment >> and management. Most of this massive duplication is not technically >> necessary. >> > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss