Cost of keeping around IndexReader instances

Vitaly Funstein Thu, 10 Oct 2013 19:01:46 -0700

Hello,

I am trying to weigh some ideas for implementing paged search functionality
in our system, which has these basic requirements:


   - Using Solr is not an option (at the moment).
   - Any Lucene 4.x version can be used.
   - Result pagination is driven by the user application code.
   - User app can request a subset of results, without sequentially
   iterating from start, by specifying start/end of range. The subset must
   correspond to exact part of the full set that matches specified offsets, if
   that had been requested to begin with, i.e. for each query, result set must
   be "stable.
   - Result set must also be detached from live data, i.e. concurrent
   mutations must not be reflected in results, throughout lifecycle of the
   whole set.

At the moment, I have come up with two different approaches to solve this,
and would like some input.

In each case, the common part is to use ReaderManager, tied to IndexWriter
on the index. For each new query received, call
ReaderManager.maybeRefresh(), followed by acquire(), but also do the
refresh in the background, on a timer - this is as recommended by the docs.
But here are the differences.

   1. Initial idea
      - When a new query is executed, I cache the DirectoryReader instance
      returned by acquire(), associating with the query itself.
      - Use a simple custom Collector that slurps in all doc ids for
      matches, and keeps them in memory, in a plain array.
      - Subsequent requests for individual result "pages" for that query
      use the cached reader, to meet the "snapshot" requirement,
referencing doc
      ids at the requested offsets, i.e. IndexReader.document(id)... or I might
      use DocValues - that's still TBD, the key is that I reuse previously
      collected doc id.
      - When the app is done with the results, it indicates so and I call
      ReaderManager.release(), all collected ids are also cleared.
   2. Alternate method
      - On query execution, fully materialize result objects from search
      and persist them in binary form in a secondary index. These are basically
      serialized POJOs, indexed by a unique combination of
      requester/query/position ids.
      - Once generated, these results never change until deleted from the
      secondary index due to app-driven cleanup.
      - Result block requests run against this index, and not the live data.
      - After materializing result set, original IndexReader (from primary
      index) is released.
      - Thus, IndexReader instances are only kept around during query
      handling.

So the questions I have here are:

   - Is my assumption correct that once opened, a particular IndexReader
   instance cannot see subsequent changes to the index it was opened on? If
   so, does every open imply an inline commit on the writer?
   - What is the cost of keeping readers around in method 1, preventing
   them from closing - in terms of memory, file handle and locks?

Of course, in either approach, I plan on using a global result set limit to
prevent misuse, similar to how a database might set a limit on open result
cursors. But this limit would be dependent on the method chosen from above,
so any hints would be appreciated.

Cost of keeping around IndexReader instances

Reply via email to