Re: Sharding Techniques

2011-05-13 Thread Toke Eskildsen
On Fri, 2011-05-13 at 12:11 +0200, Samarendra Pratap wrote: > Comparison between - single index Vs 21 indexes > Total Size - 18 GB > Queries run - 500 > % improvement - roughly 18% I was expecting a lot more. Could you test whether this is an IO-issue by selecting a slow query and performing the e

Re: Sharding Techniques

2011-05-13 Thread Samarendra Pratap
Hi Tom, Thanks for pointing me to something important (phrase queries) which I wasn't thinking of. We are using synonyms which gets expanded at run time. I'll have to give it a thought. We are not using synonyms at indexing time due to lack of flexibility of changing the list. We are not using

RE: Sharding Techniques

2011-05-12 Thread Burton-West, Tom
Hi Samar, Have you looked at top or iostat or other monitoring utilities to see if you are cpu bound vs I/O bound? With 225 term queries, it's possible that you are I/O bound. I suspect you need to think about seek time and caching. For each unique field:term combination lucene has to look up

Re: Sharding Techniques

2011-05-11 Thread Ian Lea
I'm sure that you should try building one large index and convert to NumericField wherever you can. I'm convinced that will be faster - but as ever, the proof will be in the numbers. On repeated terms, I believe that lucene will search multiple times. If so, I'd guess it is just something that ha

Re: Sharding Techniques

2011-05-11 Thread Ian Lea
ion to the file will not cause more IO as it has to skip > those bytes and write it at the end of file. > > Regards > Ganesh > > - Original Message - > From: "Burton-West, Tom" > To: > Sent: Tuesday, May 10, 2011 9:46 PM > Subject: RE: Sharding T

Re: Sharding Techniques

2011-05-11 Thread Samarendra Pratap
Hi Tom, the more i am getting responses in this thread the more i feel that our application needs optimization. 350 GB and less than 2 seconds!!! That's much more than my expectation :-) (in current scenario). *characteristics of slow queries:* there are a few reasons for greater search time

Re: Sharding Techniques

2011-05-10 Thread Ganesh
in GB's. Small addition or deletion to the file will not cause more IO as it has to skip those bytes and write it at the end of file. Regards Ganesh - Original Message - From: "Burton-West, Tom" To: Sent: Tuesday, May 10, 2011 9:46 PM Subject: RE: Sharding Techni

RE: Sharding Techniques

2011-05-10 Thread Burton-West, Tom
Hi Samar, >>Normal queries go fine under 500 ms but when people start searching >>"anything" some queries take up to > 100 seconds. Don't you think >>distributing smaller indexes on different machines would reduce the average >>.search time. (Although I have a feeling that search time for smaller

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi Mike, *"I think the usual approach is to create multiple mirrored copies (slaves) rather than sharding"* This is where my eyes stuck. We do have mirrors and in-fact a good number of those. 6 servers are being used for serving regular queries (2 are for specific queries that do take time) and e

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov
Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a si

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Thanks to Johannes - I am looking into katta. Seems promising. to Toke - Great explanation. That's what I was looking for. I'll come back and share my experience. Thank you very much. On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen wrote: > On Mon, 2011-05-09 at 13:56 +0200, Samarendra Prata

Re: Sharding Techniques

2011-05-10 Thread Toke Eskildsen
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote: > We have an index directory of 30 GB which is divided into 3 subdirectories > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). So each part is about ½ G

Re: Sharding Techniques

2011-05-10 Thread Johannes Zillmann
On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote: > Hi, > Though we have 30 GB total index, size of the indexes that are used > in 75%-80% searches is 5 GB. and we have average search time around 700 ms. > (yes, we have optimized index). > > Could someone please throw some light on my origin

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi, Though we have 30 GB total index, size of the indexes that are used in 75%-80% searches is 5 GB. and we have average search time around 700 ms. (yes, we have optimized index). Could someone please throw some light on my original doubt!!! If I want to keep smaller indexes on different servers

Re: Sharding Techniques

2011-05-09 Thread Ganesh
We are using similar technique as yours. We keep smaller indexes and use ParallelMultiSearcher to search across the index. Keeping smaller indexes is good as index and index optimzation would be faster. There will be small delay while searching across the indexes. 1. What is your search time?

Re: Sharding Techniques

2011-05-09 Thread Ian Lea
> ... > 1. I've not tested my application with single index as initially (a few > years back) we thought smaller the index size (7 indexes for default 80% > searches) the faster the search time would be ... Possibly. Maybe it will be acceptable to make some searches a bit slower in order to make

Re: Sharding Techniques

2011-05-09 Thread Samarendra Pratap
Hi Ian, Thanks for sharing your knowledge and to-the-point answers. 1. I've not tested my application with single index as initially (a few years back) we thought smaller the index size (7 indexes for default 80% searches) the faster the search time would be. Anyway i'll give it a try and share t

Re: Sharding Techniques

2011-05-09 Thread Ian Lea
30Gb isn't that big by lucene standards. Have you considered or tried just having one large index? If necessary you could restrict searches to particular "indexes", or groups thereof, by a field in the combined index, preferably used as a filter. If the slow searches have to search across 63 sep