What is the anticipated cause of corruption? Malicious? Hardware fault? This somewhat reminds of discussions in the list about encrypting the index. See LUCENE-737 and a discussion pointed by it. One of the opinions there was that encryption should be handled at a lower level (OS/FS). Wouldn't that hold here as well?
Joe R wrote: > > Hello, > > I've been asked to devise some way to discover and correct datain Lucene > indexes that have been "corrupted." The word "corrupt", in > this case, has a > few different meanings, some of which strike me as exceedingly > difficult to > grok. What concerns me are the cases where we don't know that > an index has > been changed: A bit error in a stored field, for instance, is a form of > corruption that we (ideally) should be able to identify, at the > very least, and > hopefully correct. This case in particular seems particularly > onerous, since > this isn't going to throw an exception of any sort, any time. > > We have a fairly good handle on how to remedy problems that > throw exceptions, > so we should be alright with corruption where (say) an operator > logs in and > overwrites a file. > > I'm wondering how other Lucene users have tackled this problem > in the past. > Calculating checksums on the documents seems like one way to goabout it: > compute a checksum on the document and, in a background thread, > compare the > checksum to the data. Unfortunately we're building a large, > federated system > and it would take months to exhaustively check every document this way. > Checksumming the files themselves might be too much: We're > storing gigabytes of > data per index and there is some churn to the data; in other words, the > overhead for this method might be too high. > > Thanks for any help you might have. > > > -Joseph Rose --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]