On Tue, 12 Nov 2013, berg...@merctech.com wrote:

I've got about 45TB (wasn't that once considered 'large'?) of data on
a GPFS filesystem.

I'm looking for efficient ways to find files, based on metadata.

Running "find / -ls" is not a good option anymore. :)

        I'd like to be able to query some kind of stored index of name,
        path, owner, size, modification timestamp, and ideally a checksum.

        I don't want to run desktop-oriented tools like updatedb or
        Nepomuk&Strigi, due to concerns about overhead.

        Real-time indexing is not a requirement; it's fine if metadata
        was scanned a fairly long intervals (weekly?) for updates to
        keep the impact on the filesystem lower.

        Regex queries would be great but not required.

        Statistical queries (a histogram of filesizes, etc.) would be
        great, but not required.

        I would like the ability to restrict some search paths (ie,
        don't index /human_resources/employee_complaints_by_name/)

Just thinking about the problem here. You are going to have to either do something like find periodically to update your indexes, or you are going to have to hook into the filesystem code (*notify) to detect changes to the filesystem.

the *notify approach adds overhead continually, while the periodic scan adds overhead when it runs.

If you can do the periodic scan, then using something like updatedb should be something to at least try (you may find it doesn't work for you, but you will at least have a baseline to compare everything else against)

Once updatedb has done it's scan, I don't know what other overhead you would run into when using it (unless it's particularly inefficient in storing the metadata.

David Lang

_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to