On 11/12/2013 6:24 PM, berg...@merctech.com wrote:

I've got about 45TB (wasn't that once considered 'large'?) of data on
a GPFS filesystem.

I'm looking for efficient ways to find files, based on metadata.

Running "find / -ls" is not a good option anymore. :)

        I'd like to be able to query some kind of stored index of name,
        path, owner, size, modification timestamp, and ideally a checksum.

        I don't want to run desktop-oriented tools like updatedb or
        Nepomuk&Strigi, due to concerns about overhead.

        Real-time indexing is not a requirement; it's fine if metadata
        was scanned a fairly long intervals (weekly?) for updates to
        keep the impact on the filesystem lower.

        Regex queries would be great but not required.

        Statistical queries (a histogram of filesizes, etc.) would be
        great, but not required.

        I would like the ability to restrict some search paths (ie,
        don't index /human_resources/employee_complaints_by_name/)

Has anyone seen or used an enterprise-level tool for this kind of search?

I read a paper about Spyglass[1] which looked great, but I can't
find the software or evidence that it's actually in use.

Thanks,


There's one option that I haven't seen anybody cover, and it's one of the best ones. GPFS has a very fast and efficient policy engine. We use this all the time to do backups. Check the GPFS Advanced ADministration Guide.
Chapter 2: Lifecyle Management

It is so much faster than find it's ridiculous. You can also run it in parallel across all of your NSD servers (make sure you store partial results in a shared filesystem). It uses a callback mechanism to collect the results, but your policy can include any aspect of available metadata.

We scan 10TB routinely in about 30 seconds. We can scan 700TB and 700 million files in about 24 minutes. The performance is scalable based upon metadata speed (ours is on TMS RamSan 820) and number of NSDs and CPU threads desired, also many other parameters that you can specify with mmapplypolicy.

Here's an example policy:

RULE EXTERNAL LIST 'find' EXEC '/usr/local/gpfs/pooladm/find_list'
RULE 'Findnew' LIST 'find' DIRECTORIES_PLUS WHERE (PATH_NAME LIKE '%/gc/objs/common' OR PATH_NAME LIKE '%gc/objs/src' OR PATH_NAME LIKE '%gc/objs/csrc') AND MISC_ATTRIBUTES like '%D%'

find_list is a simple shell script.

Let me know if you need any other pointers. As you can see, policies are an SQL like language. You can also pass in variables.






_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to