RE: What kind of System Resources are required to index 625 million row table...???

Bennett, Tony Tue, 16 Aug 2011 10:38:18 -0700

Thanks for the suggestion.

Yes, we are using "no_norms".


-----Original Message-----
From: Mark Harwood [mailto:markharw...@yahoo.co.uk] 
Sent: Tuesday, August 16, 2011 10:12 AM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million 
row table...???

Check  "norms" are disabled on your fields because they'll cost you1byte x 
NumberOfDocs x numberOfFieldsWithNormsEnabled.



On 16 Aug 2011, at 15:11, Bennett, Tony wrote:

> Thank you for your response.
> 
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, and we are
> storing it as "NumericField".
> 
> We are sorting on timestamp, so that we can give our
> users the most "current" matches, since we are limiting
> the number of responses to about 1000.  We are concerned
> that limiting the number of responses without sorting,
> may give the user the "oldest" matches, which is not 
> what they want.
> 
> Your suggestion about reducing the granularity of the 
> sort is interesting.  We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of 
> "date" instead of "date+time", which would be used for 
> sorting only. 
> 
> -tony
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Tuesday, August 16, 2011 5:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million 
> row table...???
> 
> About your OOM. Grant asked a question that's pretty important,
> how many unique terms in the field(s) you sorted on? At a guess,
> you tried sorting on your timestamp and your timestamp has
> millisecond or less granularity, so there are 625M of them.
> 
> Memory requirements for sorting grow as the number of *unique*
> terms. So you might be able to reduce the sorting requirements
> dramatically if you can use a coarser time granularity.
> 
> And if you're storing your timestamp as a string type, that's
> even worse, there are 60 or so bytes of overhead for
> each string.... see NumericField....
> 
> And if you can't reduce the granularity of the timestamp, there
> are some interesting techniques for reducing the memory
> requirements of timestamps that you sort on that we can discuss....
> 
> Luke can answer these questions if you point it at your index,
> but it may take a while to examine your index, so be patient.
> 
> Best
> Erick
> 
> On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <bennett.t...@con-way.com> 
> wrote:
>> Thanks for the quick response.
>> 
>> As to your questions:
>> 
>>  Can you talk a bit more about what the search part of this is?
>>  What are you hoping to get that you don't already have by adding in search? 
>>  Choices for fields can have impact on
>>  performance, memory, etc.
>> 
>> We currently have a "exact match" search facility, which uses SQL.
>> We would like to add "text search" capabilities...
>> ...initially, having the ability to search the 229 character field for a 
>> given word, or phrase, instead of an exact match.
>> A future enhancement would be to add a synonym list.
>> As to "field choice", yes, it is possible that all fields would be involved 
>> in the "search"...
>> ...in the interest of full disclosure, the fields are:
>>   - corp  - corporation that owns the document
>>   - type  - document type
>>   - tmst  - creation timestamp
>>   - xmlid - xml namespace ID
>>   - tag   - meta data qualifier
>>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>> 
>> 
>> 
>>  Was this single threaded or multi-threaded?  How big was the resulting 
>> index?
>> 
>> The search would be a threaded application.
>> 
>>  How big was the resulting index?
>> 
>> The index that was built was 70 GB in size.
>> 
>>  Have you tried increasing the heap size?
>> 
>> We have increased the up to 4 GB... on an 8 GB machine...
>> That's why we'd like a methodology for calculating memory requirements
>> to see if this application is even feasible.
>> 
>> Thanks,
>> -tony
>> 
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsing...@apache.org]
>> Sent: Monday, August 15, 2011 2:33 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625 million 
>> row table...???
>> 
>> 
>> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>> 
>>> We are examining the possibility of using Lucene to provide Text Search
>>> capabilities for a 625 million row DB2 table.
>>> 
>>> The table has 6 fields, all which must be stored in the Lucene Index.
>>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long 
>>> long).
>> 
>> Can you talk a bit more about what the search part of this is?  What are you 
>> hoping to get that you don't already have by adding in search?  Choices for 
>> fields can have impact on performance, memory, etc.
>> 
>>> 
>>> We have written a test app on a development system (AIX 6.1),
>>> and have successfully Indexed 625 million rows...
>>> ...which took about 22 hours.
>> 
>> Was this single threaded or multi-threaded?  How big was the resulting index?
>> 
>> 
>>> 
>>> When writing the "search" application... we find a simple version works, 
>>> however,
>>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>> 
>> 
>> How many terms do you have in your index and in the field you are 
>> sorting/filtering on?  Have you tried increasing the heap size?
>> 
>> 
>>> Before continuing our research, we'd like to find a way to determine
>>> what system resources are required to run this kind of application...???
>> 
>> I don't know that there is a straight forward answer here with the 
>> information you've presented.  It can depend on how you intend to 
>> search/sort/filter/facet, etc.  General rule of thumb is that when you get 
>> over 100M documents, you need to shard, but you also have pretty small 
>> documents so your mileage may vary.   I've seen indexes in your range on a 
>> single machine (for small docs) with low search volumes, but that isn't to 
>> say it will work for you without more insight into your documents, etc.
>> 
>>> In other words, how do we calculate the memory needs...???
>>> 
>>> Have others created a similar sized Index to run on a single "shared" 
>>> server...???
>>> 
>> 
>> Off the cuff, I think you are pushing the capabilities of doing this on a 
>> single machine, especially the one you have spec'd out below.
>> 
>>> 
>>> Current Environment:
>>> 
>>>       Lucene Version: 3.2
>>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build 
>>> jvmap6460-20090215_29883
>>>                        (i.e. 64 bit Java 6)
>>>       OS:                     AIX 6.1
>>>       Platform:               PPC  (IBM P520)
>>>       cores:          2
>>>       Memory:         8 GB
>>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>> 
>>> Any guidance would be greatly appreciated.
>>> 
>>> -tony
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> Lucid Imagination
>> http://www.lucidimagination.com
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Reply via email to