On Tue, 12 Nov 2013, berg...@merctech.com wrote:
I've got about 45TB (wasn't that once considered 'large'?) of data on
a GPFS filesystem.
I'm looking for efficient ways to find files, based on metadata.
Running "find / -ls" is not a good option anymore. :)
I'd like to be able to query some kind of stored index of name,
path, owner, size, modification timestamp, and ideally a checksum.
I don't want to run desktop-oriented tools like updatedb or
Nepomuk&Strigi, due to concerns about overhead.
Real-time indexing is not a requirement; it's fine if metadata
was scanned a fairly long intervals (weekly?) for updates to
keep the impact on the filesystem lower.
Regex queries would be great but not required.
Statistical queries (a histogram of filesizes, etc.) would be
great, but not required.
I would like the ability to restrict some search paths (ie,
don't index /human_resources/employee_complaints_by_name/)
Just thinking about the problem here. You are going to have to either do
something like find periodically to update your indexes, or you are going to
have to hook into the filesystem code (*notify) to detect changes to the
filesystem.
the *notify approach adds overhead continually, while the periodic scan adds
overhead when it runs.
If you can do the periodic scan, then using something like updatedb should be
something to at least try (you may find it doesn't work for you, but you will at
least have a baseline to compare everything else against)
Once updatedb has done it's scan, I don't know what other overhead you would run
into when using it (unless it's particularly inefficient in storing the
metadata.
David Lang
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/