On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <farh...@gmail.com> wrote: >> I see -- so your file format allows you to append to the same file >> without affecting prior readers? We never do that in Lucene today >> (all files are "write once"). > > Yes. For the most part it only appends. The exception is when the > log's entry count is updated (when the appends actually "commit"). > That count is written into one of many (>= 2 and <= 257) slots in round > robin fashion and finally a byte-size "keystone" is updated to determine > the active slot. The idea is that by flushing the file just before and just > after the keystone byte update we ensure persistence (assumes writing > one byte is always all-or-nothing). Increasing the number of slots > the count is written to narrows the chance of a bad read..
OK; this approach (modifying an already written & possible in-use (by an IndexReader) file) would be problematic for Lucene... >> Good questions (terms dict & stored fields) -- I think what we'd do is >> simply write a new segment, so stored fields, term vectors, postings, >> terms dict, etc., are all private to that new segment. But, we'd mark >> the segment as being a delta segment, referencing the original >> segment, and we'd remap the docIDs when flushing that segment. > > The .fdx and .tvx files use fixed-width tables, so it seems if there are > large gaps in the updated doc-ids, we'd have to fill in "null" rows for > those gaps. Or do you have another data-structure in mind for these > .*x files? Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx. > One solution may be to combine the approaches we described: maintain > doc-id mappings for the .fdx and .tvx files for per-document > data at search time, and index-time mapped doc-ids for the posting > lists. True, we could do both... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org