I've been asked to do a project which provides full-text search for a large database of articles. The expectation is that most of the articles are fairly small (<2k bytes). There will be an initial population of around 400,000 articles. There will then be approximately 2000 new articles added each day (they need to be added in "real time" (within a few minutes of arrival), but will be spread out during the day). So, roughly another 700,000 articles each year.
I've read enough to believe that having a lucene database of several million articles is doable. And, adding 2000 articles per day wouldn't seem to be that many. My concern is the real-time nature of the application. I'm a bit nervous (perhaps without justification) at simply growing one monolithic lucene database. Should there be a crash, the database will be unusable and I'll have to rebuild from scratch (which, based on my experience, would be hours of time). Some of my thoughts were: 1) having monthly databases and using MultiSearcher to search across them. That way my exposure for a corrupted database is limited to this month's database. This would also seem to give me somewhat better control--meaning a) if the search was generating lots of hits, I could display the results a month at a time and not bury them with output. It would also spread their search CPU out better and not prevent other individuals from doing a search. If there were very few results, I could sleep between each month's search and again, not lock everyone else out from searches. 2) Have a "this month's" searchable and an "everything else" searchable. At the beginning of each month, I would consolidate the previous month's database into the "everything else" searchable. This would give more consistent results for relevancy ranked searches. But, it means that a bad search could return lots of results. Has anyone else dealt with a similar problem? Am I expecting too much from Lucene running on a single machine (or should I be looking at Hadoop?). Any comments or links to previous discussions on this topic would be appreciated. Scott