On 13/11/2013 00:24, berg...@merctech.com wrote:
I've got about 45TB (wasn't that once considered 'large'?) of data on
a GPFS filesystem.
I'm looking for efficient ways to find files, based on metadata.
Running "find / -ls" is not a good option anymore. :)
I'd like to be able to query some kind of stored index of name,
path, owner, size, modification timestamp, and ideally a checksum.
I don't want to run desktop-oriented tools like updatedb or
Nepomuk&Strigi, due to concerns about overhead.
Real-time indexing is not a requirement; it's fine if metadata
was scanned a fairly long intervals (weekly?) for updates to
keep the impact on the filesystem lower.
Regex queries would be great but not required.
Statistical queries (a histogram of filesizes, etc.) would be
great, but not required.
I would like the ability to restrict some search paths (ie,
don't index /human_resources/employee_complaints_by_name/)
Has anyone seen or used an enterprise-level tool for this kind of search?
Hello,
Since this is GPFS, you can use the "fast (discardable) snapshot
metadata gathering" features.
There are examples in the standard GPFS installation directory
"/usr/lpp/mmfs/samples/util" (mostly in C using the GPFS API).
The interesting part of the snapshot metadata gathering is that it's
really fast.
We use tools partly based on one of the examples ("tsreaddir.c") to
gather, every day, almost all metadata for ~500 millions objects/1
~PiByte producing about 120 GiBytes of "raw data".
This output is (also daily) processed to produce histograms (file size
distribution, file types, access times, etc.), tables (how much is each
user/project using) and such (now flashy dynamic colorful stuff with
"d3.js").
If your number of files is not too large, you can feed the output of the
gathering in a database to query it.
There is an opensource project <http://robinhood.sourceforge.net> that
does that (and more). It's designed with Lustre in mind, but they also
have a generic POSIX FS support.
As far as I know, there are no checksum readily available in the GPFS
metadata, you have to compute your own.
One "elegant" solution is to store a checksum/hash in a POSIX Extended
Attribute and GPFS 3.5 has features to efficiently access POSIX EA from
snapshots.
Loïc.
--
| Loïc Tortay <tor...@cc.in2p3.fr> - IN2P3 Computing Centre |
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/