On 11/12/2013 04:02 PM, Loic Tortay wrote: > On 13/11/2013 00:24, berg...@merctech.com wrote: >> >> I've got about 45TB (wasn't that once considered 'large'?) of data on >> a GPFS filesystem. >> >> I'm looking for efficient ways to find files, based on metadata. >> >> Running "find / -ls" is not a good option anymore. :) >> >> I'd like to be able to query some kind of stored index of name, >> path, owner, size, modification timestamp, and ideally a checksum. >> >> I don't want to run desktop-oriented tools like updatedb or >> Nepomuk&Strigi, due to concerns about overhead. >> >> Real-time indexing is not a requirement; it's fine if metadata >> was scanned a fairly long intervals (weekly?) for updates to >> keep the impact on the filesystem lower. >> >> Regex queries would be great but not required. >> >> Statistical queries (a histogram of filesizes, etc.) would be >> great, but not required. >> >> I would like the ability to restrict some search paths (ie, >> don't index /human_resources/employee_complaints_by_name/) >> >> Has anyone seen or used an enterprise-level tool for this kind of search? >> > Hello, > Since this is GPFS, you can use the "fast (discardable) snapshot > metadata gathering" features. > There are examples in the standard GPFS installation directory > "/usr/lpp/mmfs/samples/util" (mostly in C using the GPFS API). > The interesting part of the snapshot metadata gathering is that it's > really fast. > > We use tools partly based on one of the examples ("tsreaddir.c") to > gather, every day, almost all metadata for ~500 millions objects/1 > ~PiByte producing about 120 GiBytes of "raw data". > This output is (also daily) processed to produce histograms (file size > distribution, file types, access times, etc.), tables (how much is each > user/project using) and such (now flashy dynamic colorful stuff with > "d3.js"). > > If your number of files is not too large, you can feed the output of the > gathering in a database to query it. > > There is an opensource project <http://robinhood.sourceforge.net> that > does that (and more). It's designed with Lustre in mind, but they also > have a generic POSIX FS support. > > > As far as I know, there are no checksum readily available in the GPFS > metadata, you have to compute your own. > One "elegant" solution is to store a checksum/hash in a POSIX Extended > Attribute and GPFS 3.5 has features to efficiently access POSIX EA from > snapshots.
I'll second the use of the GPFS API. We've used that in the past to find quickly files based on arbitrary attributes. Now that we have other storage, though, we've had to come up with a more generic solution. Most of what we care about is answering "who's hogging all the space". For a variety of reasons, quotas are not viable across all our storage platforms. What I've come up with is composed of three techniques: 1. A hybrid of find and du that will use the same filehandle for pattern matching as getting size (and other inode) information. The problem with using find+du is that multiple stat(2) calls are needed, at least one for find and at least one for du. The hybrid uses ftw(3) (same as find itself) and passes the filehandle it gets directly to fstat(2). Various filters I've added are for atime, mtime, and regex match on path. "find -ls -printf" is an option as well, but is not available on all platforms. 2. A shell script that traverses a filesystem root at an arbitrary depth, generating a randomized list of directories and creates a pool of files suitable for input to Grid Engine array jobs (we use Grid Engine on all of our clusters). 3. A Grid Engine submit script that will create an array job, each element of which runs a hybrid find/du using the directory list as input. Each element will have its standard output written to a separate file. The final process is a simple awk script that reads in the output files. I've also been meaning to look at Robin Hood[1] in my free time (ha!) - it was posted on the Beowulf list a while back and looks like a really versatile tool. [1] http://sourceforge.net/apps/trac/robinhood Skylar _______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/