Have a look at this article if you have not already gone through it. http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
On Thu, Aug 14, 2014 at 11:16 PM, Michael Jennings < mike.c.jenni...@gmail.com> wrote: > Hi everyone, > > I'm a bit of a Lucene newb, but a fairly experienced Java developer. Hope > someone can give me some clues as to what I may be doing wrong. > > In essence I've got a lucene index built off of a database table that gets > updated at a rate of about 1 row changing every 2 seconds or so. I've got a > webapp whose sole purpose in life is to provide a simple front end for > searching this table. > > The table in question lives in an Oracle db (not that Java cares) and it > has 2 datetime/timestamp columns; ent_dtm and upd_dtm. When a new row gets > inserted into the table, a trigger sets the ent_dtm to be "right now". When > a row gets updated, a trigger sets the upd_dtm to be "right now". > > queries like: SELECT COL1, COL2,... COLn from THE_TABLE where ENT_DTM > > (some timestamp) are very fast, as are queries like: > > SELECT COL1, COL2,... COLn from THE_TABLE where UPD_DTM > (some timestamp) > > These are the sorts of queries I use to keep my lucene index "in synch" > with the table and these queries are fast and there are no issues with > them. > > As you would expect, each Document in my lucene index roughly corresponds > to a row in THE_TABLE, including 2 fields called "ent_dtm" and "upd_dtm" > > THE_TABLE has a primary key which I will call THE_ID. Correspondingly, a > Document in the Lucene index has a field called "the_id" > > values of "the_id" are typically numbers (Field.Store.YES, > Field.Index.NOT_ANALYZED_NO_NORMS) with the exception of a "special" value > of "newest". The Document with the field "the_id" with the value of > "newest" contains just 2 more fields, ent_dtm and upd_dtm. > > This Document is just used to keep track of "what's the newest thing in > Lucene's world" > > So this is what my webapp is doing: > > In a background thread, every 1.2 seconds it checks the Lucene index for > "what's the newest thing in my world" (call that X) uses that to hit the > database asking it in essence "have you got anything newer in your world > than X", if it returns say 3 rows newer than X, call the newest of those > rows Y. > > Then, this background thread updates the Document with the_id="newest" with > Y then goes to sleep again for 1.2 seconds. Lather, rinse, repeat. > > Incoming search requests attempt to use a "Near Real Time" IndexReader > (with an IndexSearcher wrapped around it) to search the index. > > Again, everything seems to do what it says on the box. > > My problem is that I can't seem to avoid the occasional 100 second pause > while IndexReader "refreshes itself". > > I create my one-and-only shared IndexReader thusly: > > indexReader = IndexReader.open(indexWriter, true); > > and I check if it needs to be refreshed by calling indexReader.isCurrent() > > and I "refresh" it with the following method: > > public static IndexReader freshVersionOf(IndexReader indexReader) throws > IOException { > StopWatch stopWatch = new StopWatch(); > final IndexReader newReader = IndexReader.openIfChanged(indexReader, > true); > logger.info("IndexReader.openIfChanged() took " + > stopWatch.elapsedSeconds() + " seconds"); > if (newReader == null) { > return indexReader; > } else { > indexReader.close(); > return newReader; > } > } > > Which is basically a Lucene method moved into a static method in my own > code (my method closes the old indexReader, that's the only difference) > > > Sometimes IndexReader.openIfChanged(indexReader, true); takes what seems > like a crapload of time. If I don't "freshen" the IndexReader, it doesn't > see the latest-and-greatest timestamp (ie. what is newest in the Lucene > world). I've tried doing indexWriter.commit() in my background thread, but > that can take on the order of 100 seconds as well. > > Anyway, all the searching and updating of the index is all working just > fine, it's just that I'm seeing these occasional long periods of time which > seem to be unavoidable. > > Any suggestions of things to try would be appreciated! > > PS. I'm using Lucene 3.6 which it seems lots of people have used > successfully in the past, so I'm guessing the "use the newer Lucene" won't > necessarily help me. > > > -- > Mike Jennings >