> OK; this approach (modifying an already written & possible in-use (by > an IndexReader) file) would be problematic for Lucene...
If you have N slots, there would have to be N-1 commits + an Nth commit in progress while reading the "entry-count block" for there to be the possibility of a bad read. Make N large enough (max 256), and that should close the window, I think. Any way, just want to thank you Mike for sharing your thoughts and ideas. Time to try some of them.. Cheers, -Babak On Wed, Jan 20, 2010 at 3:41 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <farh...@gmail.com> wrote: >>> I see -- so your file format allows you to append to the same file >>> without affecting prior readers? We never do that in Lucene today >>> (all files are "write once"). >> >> Yes. For the most part it only appends. The exception is when the >> log's entry count is updated (when the appends actually "commit"). >> That count is written into one of many (>= 2 and <= 257) slots in round >> robin fashion and finally a byte-size "keystone" is updated to determine >> the active slot. The idea is that by flushing the file just before and just >> after the keystone byte update we ensure persistence (assumes writing >> one byte is always all-or-nothing). Increasing the number of slots >> the count is written to narrows the chance of a bad read.. > > OK; this approach (modifying an already written & possible in-use (by > an IndexReader) file) would be problematic for Lucene... > >>> Good questions (terms dict & stored fields) -- I think what we'd do is >>> simply write a new segment, so stored fields, term vectors, postings, >>> terms dict, etc., are all private to that new segment. But, we'd mark >>> the segment as being a delta segment, referencing the original >>> segment, and we'd remap the docIDs when flushing that segment. >> >> The .fdx and .tvx files use fixed-width tables, so it seems if there are >> large gaps in the updated doc-ids, we'd have to fill in "null" rows for >> those gaps. Or do you have another data-structure in mind for these >> .*x files? > > Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx. > >> One solution may be to combine the approaches we described: maintain >> doc-id mappings for the .fdx and .tvx files for per-document >> data at search time, and index-time mapped doc-ids for the posting >> lists. > > True, we could do both... > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org