UPDATE: I went with method 1, i.e. keeping IndexReader instances open between requests. Which brings me back to the original questions - is there any way of quantifying the impact of not closing a particular IndexReader? Does this depend on # of segments per index, open file count etc?
On Thu, Oct 10, 2013 at 7:01 PM, Vitaly Funstein <vfunst...@gmail.com>wrote: > Hello, > > I am trying to weigh some ideas for implementing paged search > functionality in our system, which has these basic requirements: > > - Using Solr is not an option (at the moment). > - Any Lucene 4.x version can be used. > - Result pagination is driven by the user application code. > - User app can request a subset of results, without sequentially > iterating from start, by specifying start/end of range. The subset must > correspond to exact part of the full set that matches specified offsets, if > that had been requested to begin with, i.e. for each query, result set must > be "stable. > - Result set must also be detached from live data, i.e. concurrent > mutations must not be reflected in results, throughout lifecycle of the > whole set. > > At the moment, I have come up with two different approaches to solve this, > and would like some input. > > In each case, the common part is to use ReaderManager, tied to IndexWriter > on the index. For each new query received, call > ReaderManager.maybeRefresh(), followed by acquire(), but also do the > refresh in the background, on a timer - this is as recommended by the docs. > But here are the differences. > > 1. Initial idea > - When a new query is executed, I cache the DirectoryReader > instance returned by acquire(), associating with the query itself. > - Use a simple custom Collector that slurps in all doc ids for > matches, and keeps them in memory, in a plain array. > - Subsequent requests for individual result "pages" for that query > use the cached reader, to meet the "snapshot" requirement, referencing > doc > ids at the requested offsets, i.e. IndexReader.document(id)... or I > might > use DocValues - that's still TBD, the key is that I reuse previously > collected doc id. > - When the app is done with the results, it indicates so and I call > ReaderManager.release(), all collected ids are also cleared. > 2. Alternate method > - On query execution, fully materialize result objects from search > and persist them in binary form in a secondary index. These are > basically > serialized POJOs, indexed by a unique combination of > requester/query/position ids. > - Once generated, these results never change until deleted from the > secondary index due to app-driven cleanup. > - Result block requests run against this index, and not the live > data. > - After materializing result set, original IndexReader (from > primary index) is released. > - Thus, IndexReader instances are only kept around during query > handling. > > So the questions I have here are: > > - Is my assumption correct that once opened, a particular IndexReader > instance cannot see subsequent changes to the index it was opened on? If > so, does every open imply an inline commit on the writer? > - What is the cost of keeping readers around in method 1, preventing > them from closing - in terms of memory, file handle and locks? > > Of course, in either approach, I plan on using a global result set limit > to prevent misuse, similar to how a database might set a limit on open > result cursors. But this limit would be dependent on the method chosen from > above, so any hints would be appreciated. >