How many documents there are in the system ? approximate it by: 20000 files * avg(docs/file)
>From my understanding your queries will be just lookup for a document ID (Q: are those IDs unique between files? or you need to filter by filename?) If that will be the only usecase than maybe you should consider some other lookup systems, a ehcache offloaded and persistent on disk might work just as well. If you are anywhere < 200 mln documents I'd say you should go with a single index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM) In a slightly beefier host and Lucene4 (try various codecs for speed/memory usage) I think you could go to 1 bln documents. If you plan on more complex queries..like given a position in a file, identify a document that contains it...than the number of documents should be reconsidered. In worst case case scenario I would go with partitioned index (5-10 partitions, but not thousands) On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote: > Hi Guys, > > Thank you very much for your answers. > > I will do some profiling on memory usage, but is there any documentation > on how Lucene uses/allocates the memory? > > Best wishes, > Rui Wang > > > On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: > > > hi > > > >>> would the memory usage go through the roof? > > > > Yup .... > > > > My past experience got me pickels in there... > > > > > > > > with regards > > karthik > > > > On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote: > > > >> Hi All, > >> > >> We are planning to use lucene in our project, but not entirely sure > about > >> some of the design decisions were made. Below are the details, any > >> comments/suggestions are more than welcome. > >> > >> The requirements of the project are below: > >> > >> 1. We have tens of thousands of files, their size ranging from 500M to > a > >> few terabytes, and majority of the contents in these files will not be > >> accessed frequently. > >> > >> 2. We are planning to keep less accessed contents outside of our > database, > >> store them on the file system. > >> > >> 3. We also have code to get the binary position of these contents in the > >> files. Using these binary positions, we can quickly retrieve the > contents > >> and convert them into our domain objects. > >> > >> We think Lucene provides a scalable solution for storing and indexing > >> these binary positions, so the idea is that each piece of the content in > >> the files will a document, each document will have at least an ID field > to > >> identify to content and a binary position field contains the starting > and > >> stop position of the content. Having done some performance testing, it > >> seems to us that Lucene is well capable of doing this. > >> > >> At the moment, we are planning to create one Lucene index per file, so > if > >> we have new files to be added to the system, we can simply generate a > new > >> index. The problem is do with searching, this approach means that we > need > >> to create an new IndexSearcher every time a file is accessed through our > >> web service. We knew that it is rather expensive to open a new > >> IndexSearcher, and are thinking of using some kind of pooling mechanism. > >> Our questions are: > >> > >> 1. Is this one index per file approach a viable solution? What do you > >> think about pooling IndexSearcher? > >> > >> 2. If we have many IndexSearchers opened at the same time, would the > >> memory usage go through the roof? I couldn't find any document on how > >> Lucene use allocate memory. > >> > >> Thank you very much for your help. > >> > >> Many thanks, > >> Rui Wang > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > -- > > *N.S.KARTHIK > > R.M.S.COLONY > > BEHIND BANK OF INDIA > > R.M.V 2ND STAGE > > BANGALORE > > 560094* > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >