Thanks for the suggestion. Yes, we are using "no_norms".
-----Original Message----- From: Mark Harwood [mailto:markharw...@yahoo.co.uk] Sent: Tuesday, August 16, 2011 10:12 AM To: java-user@lucene.apache.org Subject: Re: What kind of System Resources are required to index 625 million row table...??? Check "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled. On 16 Aug 2011, at 15:11, Bennett, Tony wrote: > Thank you for your response. > > You are correct, we are sorting on timestamp. > Timestamp has microsecond granualarity, and we are > storing it as "NumericField". > > We are sorting on timestamp, so that we can give our > users the most "current" matches, since we are limiting > the number of responses to about 1000. We are concerned > that limiting the number of responses without sorting, > may give the user the "oldest" matches, which is not > what they want. > > Your suggestion about reducing the granularity of the > sort is interesting. We must "retain" the granularity > of the "original" timestamp for Index maintenance purposes, > but we could add another field, with a granularity of > "date" instead of "date+time", which would be used for > sorting only. > > -tony > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, August 16, 2011 5:54 AM > To: java-user@lucene.apache.org > Subject: Re: What kind of System Resources are required to index 625 million > row table...??? > > About your OOM. Grant asked a question that's pretty important, > how many unique terms in the field(s) you sorted on? At a guess, > you tried sorting on your timestamp and your timestamp has > millisecond or less granularity, so there are 625M of them. > > Memory requirements for sorting grow as the number of *unique* > terms. So you might be able to reduce the sorting requirements > dramatically if you can use a coarser time granularity. > > And if you're storing your timestamp as a string type, that's > even worse, there are 60 or so bytes of overhead for > each string.... see NumericField.... > > And if you can't reduce the granularity of the timestamp, there > are some interesting techniques for reducing the memory > requirements of timestamps that you sort on that we can discuss.... > > Luke can answer these questions if you point it at your index, > but it may take a while to examine your index, so be patient. > > Best > Erick > > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <bennett.t...@con-way.com> > wrote: >> Thanks for the quick response. >> >> As to your questions: >> >> Can you talk a bit more about what the search part of this is? >> What are you hoping to get that you don't already have by adding in search? >> Choices for fields can have impact on >> performance, memory, etc. >> >> We currently have a "exact match" search facility, which uses SQL. >> We would like to add "text search" capabilities... >> ...initially, having the ability to search the 229 character field for a >> given word, or phrase, instead of an exact match. >> A future enhancement would be to add a synonym list. >> As to "field choice", yes, it is possible that all fields would be involved >> in the "search"... >> ...in the interest of full disclosure, the fields are: >> - corp - corporation that owns the document >> - type - document type >> - tmst - creation timestamp >> - xmlid - xml namespace ID >> - tag - meta data qualifier >> - data - actual metadata (example: carton of red 3 ring binders ) >> >> >> >> Was this single threaded or multi-threaded? How big was the resulting >> index? >> >> The search would be a threaded application. >> >> How big was the resulting index? >> >> The index that was built was 70 GB in size. >> >> Have you tried increasing the heap size? >> >> We have increased the up to 4 GB... on an 8 GB machine... >> That's why we'd like a methodology for calculating memory requirements >> to see if this application is even feasible. >> >> Thanks, >> -tony >> >> >> -----Original Message----- >> From: Grant Ingersoll [mailto:gsing...@apache.org] >> Sent: Monday, August 15, 2011 2:33 PM >> To: java-user@lucene.apache.org >> Subject: Re: What kind of System Resources are required to index 625 million >> row table...??? >> >> >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote: >> >>> We are examining the possibility of using Lucene to provide Text Search >>> capabilities for a 625 million row DB2 table. >>> >>> The table has 6 fields, all which must be stored in the Lucene Index. >>> The largest column is 229 characters, the others are 8, 12, 30, and 1.... >>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long >>> long). >> >> Can you talk a bit more about what the search part of this is? What are you >> hoping to get that you don't already have by adding in search? Choices for >> fields can have impact on performance, memory, etc. >> >>> >>> We have written a test app on a development system (AIX 6.1), >>> and have successfully Indexed 625 million rows... >>> ...which took about 22 hours. >> >> Was this single threaded or multi-threaded? How big was the resulting index? >> >> >>> >>> When writing the "search" application... we find a simple version works, >>> however, >>> if we add a Filter or a "sort" to it... we get an "out of memory" exception. >>> >> >> How many terms do you have in your index and in the field you are >> sorting/filtering on? Have you tried increasing the heap size? >> >> >>> Before continuing our research, we'd like to find a way to determine >>> what system resources are required to run this kind of application...??? >> >> I don't know that there is a straight forward answer here with the >> information you've presented. It can depend on how you intend to >> search/sort/filter/facet, etc. General rule of thumb is that when you get >> over 100M documents, you need to shard, but you also have pretty small >> documents so your mileage may vary. I've seen indexes in your range on a >> single machine (for small docs) with low search volumes, but that isn't to >> say it will work for you without more insight into your documents, etc. >> >>> In other words, how do we calculate the memory needs...??? >>> >>> Have others created a similar sized Index to run on a single "shared" >>> server...??? >>> >> >> Off the cuff, I think you are pushing the capabilities of doing this on a >> single machine, especially the one you have spec'd out below. >> >>> >>> Current Environment: >>> >>> Lucene Version: 3.2 >>> Java Version: J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build >>> jvmap6460-20090215_29883 >>> (i.e. 64 bit Java 6) >>> OS: AIX 6.1 >>> Platform: PPC (IBM P520) >>> cores: 2 >>> Memory: 8 GB >>> jvm memory: ` -Xms4072m -Xmx4072m >>> >>> Any guidance would be greatly appreciated. >>> >>> -tony >> >> -------------------------------------------- >> Grant Ingersoll >> Lucid Imagination >> http://www.lucidimagination.com >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org