On 10/13/2013 1:52 PM, Adrien Grand wrote:
Hi Michael,

I'm not aware enough of operating system internals to know what
exactly happens when a file is open but it sounds to be like having
separate files per document or field adds levels of indirection when
loading stored fields, so I would be surprised it it actually proved
to be more efficient than storing everything in a single file.

That's true, Adrien, there's definitely a cost to using files. There are some gnarly challenges in here (mostly to do with the large number of files, as you say, and with cleaning up after deletes - deletion is always hard). I'm not sure it's going to be possible to both clean up and maintain files for stale commits; this will become problematic in the way that having index files on NFS mounts are problematic.

I think the hope is that there will be countervailing savings during writes and merges (mostly) because we may be able to cleverly avoid copying the contents of stored fields being merged. There may also be savings when querying due to reduced RAM requirements since the large stored fields won't be paged in while performing queries. As I said, some simple tests do show improvements under at least some circumstances, so I'm pursuing this a bit further. I have a preliminary implementation as a codec now, and I'm learning a bit about Lucene's index internals. BTW SimpleTextCodec is a great tool for learning and debugging.

The background for this is a document store with large files (think PDFs, but lots of formats) that have to be tracked, and have associated metadata. We've been storing these externally, but it would be beneficial to have a single data management layer: i.e. to push this down into Lucene, for a variety of reasons. For one, we could rely on Solr to do our replication for us.

I'll post back when I have some measurements.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to