Re: Cost of keeping around IndexReader instances

Vitaly Funstein Fri, 22 Nov 2013 16:11:25 -0800

UPDATE: I went with method 1, i.e. keeping IndexReader instances open
between requests. Which brings me back to the original questions - is there
any way of quantifying the impact of not closing a particular IndexReader?
Does this depend on # of segments per index, open file count etc?



On Thu, Oct 10, 2013 at 7:01 PM, Vitaly Funstein <vfunst...@gmail.com>wrote:

> Hello,
>
> I am trying to weigh some ideas for implementing paged search
> functionality in our system, which has these basic requirements:
>
>    - Using Solr is not an option (at the moment).
>    - Any Lucene 4.x version can be used.
>    - Result pagination is driven by the user application code.
>    - User app can request a subset of results, without sequentially
>    iterating from start, by specifying start/end of range. The subset must
>    correspond to exact part of the full set that matches specified offsets, if
>    that had been requested to begin with, i.e. for each query, result set must
>    be "stable.
>    - Result set must also be detached from live data, i.e. concurrent
>    mutations must not be reflected in results, throughout lifecycle of the
>    whole set.
>
> At the moment, I have come up with two different approaches to solve this,
> and would like some input.
>
> In each case, the common part is to use ReaderManager, tied to IndexWriter
> on the index. For each new query received, call
> ReaderManager.maybeRefresh(), followed by acquire(), but also do the
> refresh in the background, on a timer - this is as recommended by the docs.
> But here are the differences.
>
>    1. Initial idea
>       - When a new query is executed, I cache the DirectoryReader
>       instance returned by acquire(), associating with the query itself.
>       - Use a simple custom Collector that slurps in all doc ids for
>       matches, and keeps them in memory, in a plain array.
>       - Subsequent requests for individual result "pages" for that query
>       use the cached reader, to meet the "snapshot" requirement, referencing 
> doc
>       ids at the requested offsets, i.e. IndexReader.document(id)... or I 
> might
>       use DocValues - that's still TBD, the key is that I reuse previously
>       collected doc id.
>       - When the app is done with the results, it indicates so and I call
>       ReaderManager.release(), all collected ids are also cleared.
>    2. Alternate method
>       - On query execution, fully materialize result objects from search
>       and persist them in binary form in a secondary index. These are 
> basically
>       serialized POJOs, indexed by a unique combination of
>       requester/query/position ids.
>       - Once generated, these results never change until deleted from the
>       secondary index due to app-driven cleanup.
>       - Result block requests run against this index, and not the live
>       data.
>       - After materializing result set, original IndexReader (from
>       primary index) is released.
>       - Thus, IndexReader instances are only kept around during query
>       handling.
>
> So the questions I have here are:
>
>    - Is my assumption correct that once opened, a particular IndexReader
>    instance cannot see subsequent changes to the index it was opened on? If
>    so, does every open imply an inline commit on the writer?
>    - What is the cost of keeping readers around in method 1, preventing
>    them from closing - in terms of memory, file handle and locks?
>
> Of course, in either approach, I plan on using a global result set limit
> to prevent misuse, similar to how a database might set a limit on open
> result cursors. But this limit would be dependent on the method chosen from
> above, so any hints would be appreciated.
>

Re: Cost of keeping around IndexReader instances

Reply via email to