Re: external file stored field codec

Michael Sokolov Thu, 17 Oct 2013 21:15:26 -0700

On 10/13/13 8:09 PM, Michael Sokolov wrote:

On 10/13/2013 1:52 PM, Adrien Grand wrote:
Hi Michael,
I'm not aware enough of operating system internals to know what
exactly happens when a file is open but it sounds to be like having
separate files per document or field adds levels of indirection when
loading stored fields, so I would be surprised it it actually proved
to be more efficient than storing everything in a single file.
That's true, Adrien, there's definitely a cost to using files. Thereare some gnarly challenges in here (mostly to do with the large numberof files, as you say, and with cleaning up after deletes - deletion isalways hard). I'm not sure it's going to be possible to both clean upand maintain files for stale commits; this will become problematic inthe way that having index files on NFS mounts are problematic.
I think the hope is that there will be countervailing savings duringwrites and merges (mostly) because we may be able to cleverly avoidcopying the contents of stored fields being merged. There may also besavings when querying due to reduced RAM requirements since the largestored fields won't be paged in while performing queries. As I said,some simple tests do show improvements under at least somecircumstances, so I'm pursuing this a bit further. I have apreliminary implementation as a codec now, and I'm learning a bitabout Lucene's index internals. BTW SimpleTextCodec is a great toolfor learning and debugging.
The background for this is a document store with large files (thinkPDFs, but lots of formats) that have to be tracked, and haveassociated metadata. We've been storing these externally, but itwould be beneficial to have a single data management layer: i.e. topush this down into Lucene, for a variety of reasons. For one, wecould rely on Solr to do our replication for us.
I'll post back when I have some measurements.

-Mike

This idea actually does seem to be working out pretty nicely. Icompared time to write and then to read documents that included a coupleof small indexed fields and a binary stored field that varied in size.Writing to external files, via the FSFieldCodec, was 3-20 times fasterthan writing to the index in the normal way (using MMapDirectory).Reading was sometimes faster and sometimes slower. I also measured timefor a forceMerge(1) at the end of each test: this was almost alwaysnearly zero when binaries were external, and grew larger with more datain the normal case. I believe the improvements we're seeing here resultlargely from removing the bulk of the data from the merge I/O path.

As with any performance measurements, a lot of factors can affect themeasurements, but this effect seems pretty robust across the conditionsI measured (different file sizes, numbers of files, and frequency ofcommits, with lots of repetition). One oddity is a large differencebetween Mac SSD filesystem (15-20x writing, reading 0.6x) viaFSFieldCodec) and Linux ext4 HD filesystem (3-4x writing, 1.5x reading).

The codec works as a wrapper around another codec (like the compressingcodecs), intercepting binary and string stored fields larger than aconfigurable threshold, and storing a file number as a reference in themain index which then functions kind of like a symlink. The codecintercepts merges in order to clean up files that are no longerreferenced, taking special care to preserve the ability of the othercodecs to perform bulk merges. The codec passes all the Lucene unittests in the o.a.l.index package.

The implementation is still very experimental: there are lots of detailsto be worked out: for example, I haven't yet measured the performanceimpact of deletions, which could be pretty significant. It would bereally great if someone with intimate knowledge of Lucene's indexinginternals were able to review it: I'd be happy to share the code and mylist of TODO's and questions if there's any interest, but at least Ithought it would be interesting to know that the approach does seem tobe worth pursuing.


-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: external file stored field codec

Reply via email to