Re: [lopsa-tech] meta-data searching on medium+ filesystems

Skylar Thompson Tue, 12 Nov 2013 17:24:08 -0800

On 11/12/2013 04:02 PM, Loic Tortay wrote:
> On 13/11/2013 00:24, berg...@merctech.com wrote:
>>
>> I've got about 45TB (wasn't that once considered 'large'?) of data on
>> a GPFS filesystem.
>>
>> I'm looking for efficient ways to find files, based on metadata.
>>
>> Running "find / -ls" is not a good option anymore. :)
>>
>>     I'd like to be able to query some kind of stored index of name,
>>     path, owner, size, modification timestamp, and ideally a checksum.
>>
>>     I don't want to run desktop-oriented tools like updatedb or
>>     Nepomuk&Strigi, due to concerns about overhead.
>>
>>     Real-time indexing is not a requirement; it's fine if metadata
>>     was scanned a fairly long intervals (weekly?) for updates to
>>     keep the impact on the filesystem lower.
>>
>>     Regex queries would be great but not required.
>>
>>     Statistical queries (a histogram of filesizes, etc.) would be
>>     great, but not required.
>>
>>     I would like the ability to restrict some search paths (ie,
>>     don't index /human_resources/employee_complaints_by_name/)
>>
>> Has anyone seen or used an enterprise-level tool for this kind of search?
>>
> Hello,
> Since this is GPFS, you can use the "fast (discardable) snapshot
> metadata gathering" features.
> There are examples in the standard GPFS installation directory
> "/usr/lpp/mmfs/samples/util" (mostly in C using the GPFS API).
> The interesting part of the snapshot metadata gathering is that it's
> really fast.
> 
> We use tools partly based on one of the examples ("tsreaddir.c") to
> gather, every day, almost all metadata for ~500 millions objects/1
> ~PiByte producing about 120 GiBytes of "raw data".
> This output is (also daily) processed to produce histograms (file size
> distribution, file types, access times, etc.), tables (how much is each
> user/project using) and such (now flashy dynamic colorful stuff with
> "d3.js").
> 
> If your number of files is not too large, you can feed the output of the
> gathering in a database to query it.
> 
> There is an opensource project <http://robinhood.sourceforge.net> that
> does that (and more).  It's designed with Lustre in mind, but they also
> have a generic POSIX FS support.
> 
> 
> As far as I know, there are no checksum readily available in the GPFS
> metadata, you have to compute your own.
> One "elegant" solution is to store a checksum/hash in a POSIX Extended
> Attribute and GPFS 3.5 has features to efficiently access POSIX EA from
> snapshots.


I'll second the use of the GPFS API. We've used that in the past to find
quickly files based on arbitrary attributes. Now that we have other
storage, though, we've had to come up with a more generic solution. Most
of what we care about is answering "who's hogging all the space". For a
variety of reasons, quotas are not viable across all our storage
platforms. What I've come up with is composed of three techniques:

1. A hybrid of find and du that will use the same filehandle for pattern
matching as getting size (and other inode) information. The problem with
using find+du is that multiple stat(2) calls are needed, at least one
for find and at least one for du. The hybrid uses ftw(3)  (same as find
itself) and passes the filehandle it gets directly to fstat(2). Various
filters I've added are for atime, mtime, and regex match on path.

"find -ls -printf" is an option as well, but is not available on all
platforms.

2. A shell script that traverses a filesystem root at an arbitrary
depth, generating a randomized list of directories and creates a pool of
files suitable for input to Grid Engine array jobs (we use Grid Engine
on all of our clusters).

3. A Grid Engine submit script that will create an array job, each
element of which runs a hybrid find/du using the directory list as
input. Each element will have its standard output written to a separate
file. The final process is a simple awk script that reads in the output
files.

I've also been meaning to look at Robin Hood[1] in my free time (ha!) -
it was posted on the Beowulf list a while back and looks like a really
versatile tool.

[1] http://sourceforge.net/apps/trac/robinhood

Skylar
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] meta-data searching on medium+ filesystems

Reply via email to