Hello, I am trying to weigh some ideas for implementing paged search functionality in our system, which has these basic requirements:
- Using Solr is not an option (at the moment). - Any Lucene 4.x version can be used. - Result pagination is driven by the user application code. - User app can request a subset of results, without sequentially iterating from start, by specifying start/end of range. The subset must correspond to exact part of the full set that matches specified offsets, if that had been requested to begin with, i.e. for each query, result set must be "stable. - Result set must also be detached from live data, i.e. concurrent mutations must not be reflected in results, throughout lifecycle of the whole set. At the moment, I have come up with two different approaches to solve this, and would like some input. In each case, the common part is to use ReaderManager, tied to IndexWriter on the index. For each new query received, call ReaderManager.maybeRefresh(), followed by acquire(), but also do the refresh in the background, on a timer - this is as recommended by the docs. But here are the differences. 1. Initial idea - When a new query is executed, I cache the DirectoryReader instance returned by acquire(), associating with the query itself. - Use a simple custom Collector that slurps in all doc ids for matches, and keeps them in memory, in a plain array. - Subsequent requests for individual result "pages" for that query use the cached reader, to meet the "snapshot" requirement, referencing doc ids at the requested offsets, i.e. IndexReader.document(id)... or I might use DocValues - that's still TBD, the key is that I reuse previously collected doc id. - When the app is done with the results, it indicates so and I call ReaderManager.release(), all collected ids are also cleared. 2. Alternate method - On query execution, fully materialize result objects from search and persist them in binary form in a secondary index. These are basically serialized POJOs, indexed by a unique combination of requester/query/position ids. - Once generated, these results never change until deleted from the secondary index due to app-driven cleanup. - Result block requests run against this index, and not the live data. - After materializing result set, original IndexReader (from primary index) is released. - Thus, IndexReader instances are only kept around during query handling. So the questions I have here are: - Is my assumption correct that once opened, a particular IndexReader instance cannot see subsequent changes to the index it was opened on? If so, does every open imply an inline commit on the writer? - What is the cost of keeping readers around in method 1, preventing them from closing - in terms of memory, file handle and locks? Of course, in either approach, I plan on using a global result set limit to prevent misuse, similar to how a database might set a limit on open result cursors. But this limit would be dependent on the method chosen from above, so any hints would be appreciated.