Single filter instance with different searchers

2010-11-03 Thread Samarendra Pratap
Hi. We have a large index (~ 28 GB) which is distributed in three different
directories, each representing a country. Each of these country wise indexes
is further distributed on the basis of last update date into 21 smaller
indexes. This index is updated once in a day.

A user can search into any of one country and can choose last update date
plus some other criteria.

When the server application starts, index readers and hence searchers are
created for each of the small indexes (21 x 3) and put in an array.
Depending on the option (country and last update date) chosen by user we
pick the searchers of correct date range/country and create a new
ParallelMultiSearcher instance.

Now my question is - can I use single filter (caching filter) instance for
every search (may be on different searchers)?
===

e.g
for first search i create an filter of experience 4 years and save it.

if another search for a different country (and hence difference index) also
has same experience criteria, i.e. 4 years, can i use the same filter
instance for second search too?

i have tested a little for this and surprisingly i have got correct results.
i was wondering if this is the correct way. or do i need to create different
filters for each searcher (or index reader) instance?

Thanks in advance.

-- 
Regards,
Samar


Re: Single filter instance with different searchers

2010-11-08 Thread Samarendra Pratap
Hi Erick, Thanks for the reply.
 Your answer have puzzled me more because what I am able to view is not what
you say or I am not able to grasp your meaning.
 I have written a small program which is exactly what my original question
was. Here I am creating a CachingWrapperFilter on one index and reusing it
on other indexes. This single filter gives me results as expected from each
of the index. I will appreciate if you can throw some light.

I have given the output after the program ends


// following program is compiled with java6

import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.search.*;
import org.apache.lucene.search.spans.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.util.*;

import java.util.*;

public class FilterTest
{
protected Directory[] dirs;
 protected Analyzer a;
protected Searcher[] searchers;
protected QueryParser qp;
 protected Filter f;
protected Hashtable filters;

 public FilterTest()
{
// create analyzer
 a = new StandardAnalyzer(Version.LUCENE_29);
// create query parser
qp = new QueryParser(Version.LUCENE_29, "content", a);
 // initialize "filters" Hashtable
filters = new Hashtable();
 }

protected void createDirectories(int length)
{
 // create specified number of RAM directories
dirs = new Directory[length];
 for(int i=0;iwrote:

> I'm assuming you're down in Lucene land. Unless somehow you've
> gotten 63 separate filters when you think you only have one, I don't
> think what you're doing will work. Or I'm failing to understand what
> you're doing at all.
>
> The problem is I expect each of your indexes starts with document
> 1. So your Filter is really a bit set keyed by Lucene document ID.
>
> So applying filter 2 to index 54 will NOT do what you want. What I
> suspect you're seeing is that applying your filter is producing enough
> results from index 54 (to continue my example) to fool you into
> thinking it's working.
>
> Try running the query with and without the filter on each of your indexes,
> perhaps as a control including a restrictive clause in the query
> to do the same thing your filter is doing. Or construct the filter new
> for comparison If the numbers continue to be the same, I clearly
> don't understand something! 
>
> Best
> Erick
>
> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap  >wrote:
>
> > Hi. We have a large index (~ 28 GB) which is distributed in three
> different
> > directories, each representing a country. Each of these country wise
> > indexes
> > is further distributed on the basis of last update date into 21 smaller
> > indexes. This index is updated once in a day.
> >
> > A user can search into any of one country and can choose last update date
> > plus some other criteria.
> >
> > When the server application starts, index readers and hence searchers are
> > created for each of the small indexes (21 x 3) and put in an array.
> > Depending on the option (country and last update date) chosen by user we
> > pick the searchers of correct date range/country and create a new
> > ParallelMultiSearcher instance.
> >
> > Now my question is - can I use single filter (caching filter) instance
> for
> > every search (may be on different searchers)?
> >
> >
> ===
> >
> > e.g
> > for first search i create an filter of experience 4 years and save it.
> >
> > if another search for a different country (and hence difference index)
> also
> > has same experience criteria, i.e. 4 years, can i use the same filter
> > instance for second search too?
> >
> > i have tested a little for this and surprisingly i have got correct
> > results.
> > i was wondering if this is the correct way. or do i need to create
> > different
> > filters for each searcher (or index reader) instance?
> >
> > Thanks in advance.
> >
> > --
> > Regards,
> > Samar
> >
>



-- 
Regards,
Samar

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Single filter instance with different searchers

2010-11-08 Thread Samarendra Pratap
> doc.get("content")));
>}
>is.close();
>}
>private void log(String msg) {
>System.out.println(msg);
>}
>private void populateIndexes() throws IOException {
>popOne(_ram1);
>popOne(_ram2);
>popOne(_ram3);
>}
>
>private void popOne(Directory dir) throws IOException {
>IndexWriter iw = new IndexWriter(dir, _std, MaxFieldLength.LIMITED);
>Document doc = new Document();
>doc.add(new Field("content", "common " +
> Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES));
>iw.addDocument(doc);
>
>doc = new Document();
>doc.add(new Field("content", "common " +
> Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES));
>iw.addDocument(doc);
>
>iw.close();
>}
>
>
>Directory _ram1 = new RAMDirectory();
>Directory _ram2 = new RAMDirectory();
>Directory _ram3 = new RAMDirectory();
>Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
> }
>
> output
> where lid: ### is the Lucene doc ID returned in scoreDocs
> ***
>
> dumping ram1
> lid: 0, content: common 0.11100571422470962
> lid: 1, content: common 0.31555863707233567
> dumping ram2
> lid: 0, content: common 0.01235509997022377
> lid: 1, content: common 0.7017712652104814
> dumping ram3
> lid: 0, content: common 0.9472403989314128
> lid: 1, content: common 0.7105628402082196
> dumping multi
> lid: 0, content: common 0.11100571422470962
> lid: 1, content: common 0.31555863707233567
> lid: 2, content: common 0.01235509997022377
> lid: 3, content: common 0.7017712652104814
> lid: 4, content: common 0.9472403989314128
> lid: 5, content: common 0.7105628402082196
>
>
>
>
> On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap  >wrote:
>
> > Hi Erick, Thanks for the reply.
> >  Your answer have puzzled me more because what I am able to view is not
> > what you say or I am not able to grasp your meaning.
> >  I have written a small program which is exactly what my original
> question
> > was. Here I am creating a CachingWrapperFilter on one index and reusing
> it
> > on other indexes. This single filter gives me results as expected from
> each
> > of the index. I will appreciate if you can throw some light.
> >
> > I have given the output after the program ends
> >
> >
> >
> 
> > // following program is compiled with java6
> >
> > import org.apache.lucene.index.*;
> > import org.apache.lucene.analysis.*;
> > import org.apache.lucene.analysis.standard.*;
> > import org.apache.lucene.search.*;
> > import org.apache.lucene.search.spans.*;
> > import org.apache.lucene.store.*;
> > import org.apache.lucene.document.*;
> > import org.apache.lucene.queryParser.*;
> > import org.apache.lucene.util.*;
> >
> > import java.util.*;
> >
> > public class FilterTest
> > {
> > protected Directory[] dirs;
> >  protected Analyzer a;
> > protected Searcher[] searchers;
> > protected QueryParser qp;
> >  protected Filter f;
> > protected Hashtable filters;
> >
> >  public FilterTest()
> > {
> > // create analyzer
> >  a = new StandardAnalyzer(Version.LUCENE_29);
> > // create query parser
> > qp = new QueryParser(Version.LUCENE_29, "content", a);
> >  // initialize "filters" Hashtable
> > filters = new Hashtable();
> >  }
> >
> > protected void createDirectories(int length)
> > {
> >  // create specified number of RAM directories
> > dirs = new Directory[length];
> >  for(int i=0;i > dirs[i] = new RAMDirectory();
> > }
> >
> > protected void createIndexes() throws Exception
> > {
> > /* create indexes for each directory.
> >  each index contains two documents.
> > every document contains one term, unique across all indexes, one term
> > unique across single index and one term common to all indexes
> >  */
> > for(int i=0;i > {
> >  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> > IndexWriter.MaxFieldLength.LIMITED);
> >
> > Document d = new Document();
> >  // unique id across all indexes
> > d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> &g

Re: Single filter instance with different searchers

2010-11-09 Thread Samarendra Pratap
Thanks Erick for you insight.

I'd appreciate if someone could throw more light on it.

Thanks

On Tue, Nov 9, 2010 at 11:27 PM, Erick Erickson wrote:

> I'm going to have to leave answering that to people with more
> familiarity with the underlying code than I have...
>
> That said, I'd #guess# that you'll be OK because I'd #guess# that
> filters are maintained on a per-reader basis and the results
> are synthesized when combined in a MultiSearcher.
>
> But that's all a guess
>
> Best
> Erick
>
> On Tue, Nov 9, 2010 at 2:48 AM, Samarendra Pratap  >wrote:
>
> > Thanks Erick, you cleared some of my confusions. But I still have a
> doubt.
> >
> >  As you can see in previous example code I am re-creating parallel multi
> > searcher for each search. (This is the actual scenario on production
> > servers)
> > The ParallelMultiSearcher constructor is taking different combination of
> > searchers each time. It means that the same document may be assigned a
> > different docid for next search.
> >
> > So my primary question is - Will the cached results from a filter created
> > with one multi searcher, work fine with another multi searcher?
> (Underlying
> > IndexSearchers are opened only once. It is combination of IndexSearchers
> > which is varying for each search.)
> >
> >  I have tested it with my real code and sample indexes and it gives me a
> > feeling that results are correct, but I am not able to understand how,
> > given
> > my above confusion.
> >
> > Can you suggest me a something with another curiosity - Which option will
> > be
> > more efficient - 1. MultiSearchers  (either recreating for each search or
> > reusing cached ones) with different searchers or 2. having a single index
> > for all last update date criteria and using filters for different
> > combinations of last update dates.
> >  As I wrote in my previous mail we have different physical indexes based
> on
> > different ranges of update dates. We select appropriate indexes based on
> > the
> > user selected options.
> >
> > On Tue, Nov 9, 2010 at 4:25 AM, Erick Erickson  > >wrote:
> >
> > > Ignore my previous, I thought you were constructing your own filters.
> > What
> > > you're doing should
> > > be OK.
> > >
> > > Here's the source of my confusion.  Each of your indexes has Lucene
> > > document
> > > IDs starting at
> > > 0. In your example, you have two docs/index. So, if you created a
> Filter
> > > via
> > > lower-level
> > > calls, it could not be applied across different indexes. See the
> > discussion
> > > here:
> > > http://www.gossamer-threads.com/lists/lucene/java-user/106376. That
> is,
> > > the bit in your Filter for index0, doc0 would be the same bit as in
> > index1,
> > > doc0.
> > >
> > > But, that's not what you are doing. The (Parallel)MultiSearcher takes
> > > care of mapping these doc IDs appropriately for you so you don't have
> to
> > > worry about
> > > what I was thinking about. Here's a program that illustrates this. It
> > > creates
> > > three RAMDirectories then  dumps the Lucene doc ID from each. Then it
> > > creates
> > > a multisearcher from the same three dirs and walks that, dumping the
> > Lucene
> > > doc ID.
> > > You'll see that the doc IDs change even though the contents are the
> > > same
> > >
> > > Again, though, this isn't a problem because you are using a
> > MultiSearcher,
> > > which
> > > takes care of this for you.
> > >
> > > Which is yet another reason to never, never, never count on lucene doc
> > IDs
> > > outside their context!
> > >
> > > Output at the end..
> > >
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.index.IndexWriter;
> > > import org.apache.lucene.search.*;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.RAMDirectory;
> > > import org.apache.lucene.util.Version;
> > >
> > > import java.io.IOException;
> > >
> > > import static org.apache.lucene.index.IndexWriter.*;
> > >
> > > public class EoeTest {
> &g

Re: asking about index verification tools

2010-11-16 Thread Samarendra Pratap
It is not guaranteed that every term will be indexed. There is a limit on
maximum number of terms (as in lucene 3.0 and may be earlier too) per field.
Check out this
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)

On Tue, Nov 16, 2010 at 11:36 AM, Yakob  wrote:

> hello all,
> I would like to ask about lucene index. I mean I created a simple
> program that created lucene indexes and stored it in a folder. also I
> had use a diagnostic tools name Luke to be able to lurk inside lucene
> index and find out its content. and I know that lucene is a standard
> framework when it come to building a search engine. but I just wanted
> to be sure that lucene indexes every term that existed in a file.
>
> I mean is there a way for me or some tools out there to verify that
> the index contains in lucene indexes is dependable? and not a single
> term went missing there?
>
> I know that this is subjective question but I just wanted to hear your
> two cents.
> thanks though. :-)
>
> tl;dr: how can we know that the index in lucene is correct?
>
> --
> http://jacobian.web.id
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Sharding Techniques

2011-05-09 Thread Samarendra Pratap
Hi list,
 We have an index directory of 30 GB which is divided into 3 subdirectories
(idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
(idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).

We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very
soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.

We have almost 40 fields in each index (is it a bad to have so many
fields?). most of them are id based fields.
We are using 8 servers for search, and each of which receives approximately
3000/hour queries in peak hour and search time of more than 1 second is
considered bad (is it really bad?) as per the business requirement.

Since past few months we are experiencing issues (load and search time) on
our search servers, due to which I am looking for sharding techniques. Can
someone guide or give me pointers where i can read more and test?

Keeping parts of indexes on different servers search on all of them and then
merging the results - what could be the best approach?

Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields (to
search for) but some queries (searching all the data) require all the
indexes and are primary cause of the performance degradation.

Any suggestions/ideas are greatly appreciated. And further more will
sharding (or similar thing) really reduce search time? (load is a less
severe issue when compared to search time)


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-09 Thread Samarendra Pratap
Hi Ian,
 Thanks for sharing your knowledge and to-the-point answers.

1. I've not tested my application with single index as initially (a few
years back) we thought smaller the index size (7 indexes for default 80%
searches) the faster the search time would be. Anyway i'll give it a try and
share the experience.

2. For sharing/caching we create index readers once the server starts and
use these throughout the server's life (1 day). At the time of searches,
number of indexes to be read are decided by analyzing the search parameters.
IndexSearchers are created on persistent IndexReaders and finally a
ParallelMultiSearcher is created from these IndexSearchers (I hope this is
not a problem, or is it???)

3. I had gone through the link you provided and some of the things are
already implemented (e.g. readOnly=true, NIOFSDirectory, optmizing, etc.).
We are using filters for some of the fields and caching those filters in the
memory, through hashtable.

Will reducing number of tokens in a particular field in index reduce the
search time (or CPU, memory etc)?

E.g. I have 11 documents and tokens in field (fld1) are
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9 and 2.0.

The query is - fld1:[ 1.0 TO 2.0 ]

Would it make any difference if the tokens in documents (in the same field)
would be
1,1,1,1,1,1,1,1,1,2
??



On Mon, May 9, 2011 at 6:36 PM, Ian Lea  wrote:

> 30Gb isn't that big by lucene standards.  Have you considered or tried
> just having one large index?  If necessary you could restrict searches
> to particular "indexes", or groups thereof, by a field in the combined
> index, preferably used as a filter.  If the slow searches have to
> search across 63 separate indexes it is perhaps not surprising that
> they are slow.  What do you do about sharing or caching
> searcher/reader instances?  There are lots of useful tips on
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.
>
> 40 fields isn't that many - should be fine.
>
> On sharding/scaling/etc,
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
> looks well worth a read.
>
>
> --
> Ian.
>
> On Mon, May 9, 2011 at 12:56 PM, Samarendra Pratap 
> wrote:
> > Hi list,
> >  We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
> >
> > We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very
> > soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.
> >
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
> > We are using 8 servers for search, and each of which receives
> approximately
> > 3000/hour queries in peak hour and search time of more than 1 second is
> > considered bad (is it really bad?) as per the business requirement.
> >
> > Since past few months we are experiencing issues (load and search time)
> on
> > our search servers, due to which I am looking for sharding techniques.
> Can
> > someone guide or give me pointers where i can read more and test?
> >
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
> >
> > Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields
> (to
> > search for) but some queries (searching all the data) require all the
> > indexes and are primary cause of the performance degradation.
> >
> > Any suggestions/ideas are greatly appreciated. And further more will
> > sharding (or similar thing) really reduce search time? (load is a less
> > severe issue when compared to search time)
> >
> >
> > --
> > Regards,
> > Samar
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi,
 Though we have 30 GB total index, size of the indexes that are used
in 75%-80% searches is 5 GB. and we have average search time around 700 ms.
(yes, we have optimized index).

Could someone please throw some light on my original doubt!!!
If I want to keep smaller indexes on different servers so that CPU and
memory may be optimized, how can I aggregate the results of a query from
each of the server. One thing I know is RMI which I studied a few years
back, but that was too slow (or i thought so that time). What are other
techniques?

Is 1 second a bad search time for following?
total index size: 30 GB
index size which is being used in 80% searches - 5 GB
number of fields - 40
most of the fields being numeric fields.
one big "contents" field with 500 - 1000 words.
3500 queries / second mostly on
on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens

Are there any benchmarks from which I can compare the performance of my
application? Or any approximate formula which can guide me
calculating (using system parameters and index/search stats) the "best"
expected search time?

Thanks in advance

On Tue, May 10, 2011 at 9:59 AM, Ganesh  wrote:

> We are using similar technique as yours. We keep smaller indexes and use
> ParallelMultiSearcher to search across the index. Keeping smaller indexes is
> good as index and index optimzation would be faster.  There will be small
> delay while searching across the indexes.
>
> 1. What is your search time?
> 2. Is your index optimized?
>
> I have a doubt, If we keep the indexes size to 30 GB then each file size
> (fdt, fdx etc) would in GB's. Small addition or deletion to the file will
> not cause more IO as it has to skip those bytes and write it at the end of
> file.
>
> Regards
> Ganesh
>
>
>
> - Original Message -
> From: "Samarendra Pratap" 
> To: 
> Sent: Monday, May 09, 2011 5:26 PM
> Subject: Sharding Techniques
>
>
> > Hi list,
> > We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
> >
> > We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very
> > soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.
> >
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
> > We are using 8 servers for search, and each of which receives
> approximately
> > 3000/hour queries in peak hour and search time of more than 1 second is
> > considered bad (is it really bad?) as per the business requirement.
> >
> > Since past few months we are experiencing issues (load and search time)
> on
> > our search servers, due to which I am looking for sharding techniques.
> Can
> > someone guide or give me pointers where i can read more and test?
> >
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
> >
> > Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields
> (to
> > search for) but some queries (searching all the data) require all the
> > indexes and are primary cause of the performance degradation.
> >
> > Any suggestions/ideas are greatly appreciated. And further more will
> > sharding (or similar thing) really reduce search time? (load is a less
> > severe issue when compared to search time)
> >
> >
> > --
> > Regards,
> > Samar
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Thanks
 to Johannes - I am looking into katta. Seems promising.
 to Toke - Great explanation. That's what I was looking for.

 I'll come back and share my experience.
Thank you very much.


On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen wrote:

> On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote:
> >  We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
>
> So each part is about ½ GB in size? That gives you a serious logistic
> overhead. You state later that you only update the index once a day, so
> it would seem that you have no need for the fast update times that such
> small indexes give you. My guess is that you will get faster search
> times by using a single index.
>
>
> Down to basics, Lucene searches work by locating terms and resolving
> documents from them. For standard term queries, a term is located by a
> process akin to binary search. That means that it uses log(n) seeks to
> get the term. Let's say you have 10M terms in your corpus. If you stored
> that in a single field in a single index with a single segment, it would
> take log(10M) ~= 24 seeks to locate a term. This is of course very
> simplified.
>
> When you have 63 indexes, log(n) works against you. Even with the
> unrealistic assumption that the 10M terms are evenly distributed and
> without duplicates, the number of seeks for a search that hits all parts
> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
> begun to estimate the merging part.
>
> Due to caching, a seek is not equal to the storage being hit, but the
> probability for a storage hit rises with the number of seeks and the
> inevitable term duplicates when splitting the index.
>
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
>
> Nah, our index is about 40GB with 100+ fields and 8M documents. We use a
> single index, optimized to 5 segments. Response times for raw searches
> are a few ms, while response times for the full package (heavy faceting)
> is generally below 300ms. Our queries are mostly simple boolean queries
> across 13 fields.
>
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
>
> Locate your bottleneck. Some well-placed log statements or a quick peek
> with visualvm (comes with the Oracle JVM) should help a lot.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi Mike,
*"I think the usual approach is to create multiple mirrored copies (slaves)
rather than sharding"*
This is where my eyes stuck.

 We do have mirrors and in-fact a good number of those. 6 servers are being
used for serving regular queries (2 are for specific queries that do take
time) and each of them receives around 3-3.5 K queries per hour in peak
hours.

 The problem is that the interface being used by end users has a lot of
options plus a few text boxes where they can type up to 64 words each. (and
unfortunately i am not able to reduce these things as these are business
requirements)

 Normal queries go fine under 500 ms but when people start searching
"anything" some queries take up to > 100 seconds. Don't you think
distributing smaller indexes on different machines would reduce the average
search time. (Although I have a feeling that search time for smaller queries
may be slightly increased)


On Tue, May 10, 2011 at 6:32 PM, Mike Sokolov  wrote:

>
>  Down to basics, Lucene searches work by locating terms and resolving
>> documents from them. For standard term queries, a term is located by a
>> process akin to binary search. That means that it uses log(n) seeks to
>> get the term. Let's say you have 10M terms in your corpus. If you stored
>> that in a single field in a single index with a single segment, it would
>> take log(10M) ~= 24 seeks to locate a term. This is of course very
>> simplified.
>>
>> When you have 63 indexes, log(n) works against you. Even with the
>> unrealistic assumption that the 10M terms are evenly distributed and
>> without duplicates, the number of seeks for a search that hits all parts
>> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
>> begun to estimate the merging part.
>>
> This is true, but if the indexes are kept on 63 separate servers, those
> seeks will be carried out in parallel.  The OP did indicate his indexes
> would be on different servers, I think?  I still agree with your overall
> point - at this scale a single server is probably best.  And if there are
> performance issues, I think the usual approach is to create multiple
> mirrored copies (slaves) rather than sharding.  Sharding is useful for very
> large indexes: indexes to big to store on disk and cache in memory on one
> commodity box
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-11 Thread Samarendra Pratap
Hi Tom,
 the more i am getting responses in this thread the more i feel that our
application needs optimization.

350 GB and less than 2 seconds!!! That's much more than my expectation :-)
(in current scenario).

*characteristics of slow queries:*
 there are a few reasons for greater search time

 1.Two of our fields contain decimal values but are not NumericField :( .
These fields are searched as a range. Whenever the ranges are larger and/or
both the fields are used in search the search time and server load goes
high. I have already started work to convert it to NumericField - but
suggestions and experiences are most welcome.

2. When queries (without two fields mentioned above) have a lot of
words/phrases search time is high. E.g I took a query with around 80 unique
terms (not words) in 5 fields. These terms occur repeatedly and become total
225 terms (non-unique). This particular query took 4.2 seconds. the 15
indexes used for this query were of total size 5 G.
Are 225 terms (80 unique) is a very big number?

and yes, slow queries are always slow. yes but obviously high load will add
up to their slowness.




Here I have another curiosity about something I noticed.
If I have a query like following:


title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz
title:xyz title:xyz title:xyz title:xyz

*Will lucene search for the term 11 times or it will reuse the results of
first term?*

If later is true (which I think is), is there any particular reason or it
may be optimized inside lucene?


On Tue, May 10, 2011 at 9:46 PM, Burton-West, Tom wrote:

> Hi Samar,
>
> >>Normal queries go fine under 500 ms but when people start searching
> >>"anything" some queries take up to > 100 seconds. Don't you think
> >>distributing smaller indexes on different machines would reduce the
> average
> >>.search time. (Although I have a feeling that search time for smaller
> queries
> >>may be slightly increased)
>
> What are the characteristics of your slow queries?  Can you give examples?
>   Are the slow queries always slow or only under heavy load?   What the
> bottleneck is and whether splitting into smaller indexes would help depends
> on just what your bottleneck is. It's not clear that your index is large
> enough that the size of the index is causing your bottleneck.
>
> We run indexes of about 350GB with average response times under 200ms and
> 99th percentile reponse times of under 2 seconds. (We have a very low qps
> rate however).
>
>
> Tom
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-13 Thread Samarendra Pratap
Hi Tom,
 Thanks for pointing me to something important (phrase queries) which I
wasn't thinking of.

 We are using synonyms which gets expanded at run time. I'll have to give it
a thought.

 We are not using synonyms at indexing time due to lack of flexibility of
changing the list. We are not using synonym analyzer either, because of the
issues related to synonyms of varying word length (any comments?). These
synonyms are expanded at the time of query formation and they do contain
phrases, in fact a good number (if not big) of those.

I would also like to share that following results of initial testing

Comparison between - single index Vs 21 indexes
Total Size - 18 GB
Queries run - 500
% improvement - roughly 18%

Guys here, however, were expecting more :-), but's that's good enough reason
to go for single index.

(details of index and queries are there in the thread)


On Fri, May 13, 2011 at 12:18 AM, Burton-West, Tom wrote:

> Hi Samar,
>
> Have you looked at top or iostat or other monitoring utilities to see if
> you are cpu bound vs I/O bound?
>
> With 225 term queries, it's possible that you are I/O bound.
>
> I suspect you need to think about seek time and caching. For each unique
> field:term combination lucene has to look up the postings for that term in
> the index.  Additionally for any phrase, lucene has to additionally look up
> the positions data for each term in the phrase. (In our case phrase searches
> are very expensive as our positions (*prx) index is about 8 times as large
> as our frq index) So for 225 terms including some number of phrases, that is
> a lot of disk seeks.  To the extent that the terms are close together in the
> index and various buffer caches contain adjacent terms, you might not
> actually have 225 seeks, but I suspect there will still be a lot.
>
> Although Lucene implements a number of caches (and you should take a look
> at your cache hit ratios), Lucene depends on the OS disk cache to cache
> postings data for individual terms. Most unix/linux OS's use free memory for
> disk caching.  How much memory is available on the machine after the JVM
> gets it allocation?
>
> Have you considered running cache warming queries of your most frequent
> terms/phrases so that the data is in the OS disk cache?
>
> Tom
>
>
> >> When queries (without two fields mentioned above) have a lot of
> >>words/phrases search time is high. E.g I took a query with around 80
> unique
> >>terms (not words) in 5 fields. These terms occur repeatedly and become
> total
> >>225 terms (non-unique). This particular query took 4.2 seconds. the 15
> >>indexes used for this query were of total size 5 G.
> >>Are 225 terms (80 unique) is a very big number?
>
> -Original Message-
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Rewriting an index without losing 'hidden' data

2011-05-17 Thread Samarendra Pratap
Hi, I know it is too late to answer a question (sorry Chris) but I thought
it could be useful to share things (even late).
I was just going through the mails and I found that we've done it a few
months back.

*Objective: To add a new field to existing index without re-writing the
whole index.*

We have an index ("primary index") to which we want to add a new field,  say
"tags".
Source of the data is database.

I am adding pseudo code here

Create an index "index 2" with just two fields "id" (which is also a unique
identifier in main index) and "tags" (keep it stored) from database (source
of data).

Open a new IndexWriter ("index 3")

Now run a loop over all the documents of "Primary Index" with increasing
order of doc-id
Get document of current doc-id (starting from zero)
 Find the value of "id" field
Search this value in in secondary index in the same ("id") field.  (or
directly get the document through IndexReader and termVector). You should
get only one document.
 If document is found
Add this document to "index 3"
If document is not found
 Add a blank document to "index 3" (to maintain the doc-id order)

(After the loop is finished, the doc-ids and fields of "primary index" and
"index 3" will be in order, i.e. document at doc id 5 in "index 3" and in
"primary index" would be representing the same document of the database with
different fields)

Open a *ParallelReader* ( this is the key :-) ) and add both the indexes
("primary index" and "index 3") one by one.
Open an IndexWriter and use addIndexes(IndexReader) to create a single
index.
The final index will contain primary index with "tags" field. :-)


I request the list to comment if there could be any issue with that.


My question follows then -
I tried this on NumericField (as "tags") but this didn't work.
My guess (excuse me for guessing without deeper investigations) is that this
is because NumericField is not a Field. It is an AbstractField

Irrespective of the correctness of my guess can someone give me a hint or
point me to something which can help me doing the same process successfully
for NumericField as well?

I hope to listen from learned people.


On Fri, Apr 8, 2011 at 9:38 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Unfortunately, updateDocument replaces the *entire* previous document
> with the new one.
>
> The ability to update a single indexed field (either replace that
> field entirely, or, change only certain token occurrences within it),
> while leaving all other indexed fields in the document unaffected, has
> been a long requested big missing feature in Lucene.  We call it
> "incremental field updates".
>
> There have been some healthy discussions on the dev list, that have
> worked out a good rough design (eg see
> http://markmail.org/thread/lsfjhpiblzymkfcn).  Also, recent
> improvements in how buffered deletes are handled should make it alot
> easier for updates to "piggyback" using that same packet stream
> approach.  So... I think there is hope some day that we'll get this
> into Lucene.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Fri, Apr 8, 2011 at 11:00 AM, Ian Lea  wrote:
> > Unfortunately you just can't do this.  Might be possible if all fields
> > were stored but evidently they are not in your index.  For unstored
> > fields, the Document object will not contain the data that was passed
> > in when the doc was originally added.
> >
> > I believe there might be a way of recreating some of the missing data
> > via TermFreqVector but that has always sounded dodgy and lossy to me.
> >
> > The safest way is to reindex, however painful it might be.  Maybe you
> > could take the opportunity to upgrade lucene at the same time!
> >
> >
> > --
> > Ian.
> >
> >
> > On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
> >  wrote:
> >> Hi,
> >>
> >> I recently discovered that I need to add a single field to every
> document in an existing (very large) index.  Reindexing from scratch is not
> an option I want to consider right now, so I wrote a utility to add the
> field by rewriting the index - but this seemed to lose some of the fields
> (indexed, but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb -
> clearly not what I wanted.  :-)
> >> What am I doing wrong?
> >>
> >> My technique was:
> >>
> >>  Analyzer analyser = new StandardAnalyzer();
> >>  IndexSearcher searcher = new IndexSearcher(indexPath);
> >>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
> >>  Hits hits = matchAllDocumentsFromIndex(searcher);
> >>
> >>  for (int i=0; i < hits.length(); i++) {
> >>  Document doc = hits.doc(i);
> >>  String id = doc.get("unique-id");
> >>  doc.add(new Field("newField", newValue, Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> >>  indexWriter.updateDocument(new Term("unique-id", id), doc);
> >>  }
> >>
> >>  searcher.close();
> >>  indexWriter.optimize();
> >>  indexWriter.close();
> >>
> >> Note that my matchAllDocumentsFromIndex() does get the right 

Re: about analyzer for searching location

2010-04-16 Thread Samarendra Pratap
Hi. I don't think you need a different analyzer. Read about
PhraseQuery.
If you are using parse() method of QueryParser. Enclose the searched string
in extra double quotes, which must obviously be escaped.

Query q = qp.parse("\"united states\"");


2010/4/15 Ian.huang 

> Hi All,
>
> I am implementing a search function for address by hibernate search which
> is based on lucene. The class definition as following:
>
> @Indexed
> public class Address implements Cloneable
> {
> @DocumentId
> private int id;
> @Field
> private String addrCountry;
> private String addrDesc;
> @Field
> private String addrLineOne;
> private String addrLineTwo;
> @Field
> private String addrCity;
> ..
>
> As you see, addrCountry, addrLineone and addrCity are fields for search. I
> am using default analyzer in index & search. So I think country name like
> United States would be indexed as two terms United, and states.
>
> In addition, during search, a search keyword like United states, or Salt
> lake city would be tokenized as two or three single words.
>
> As result, any address fields contain united, city would be returned. like
> United Kingdom, but actually I want to get a result of united states.
>
> My expected result as following:
>
> if someone searches for "united" it should return "united states" and
> "united kingdom".
>
> if someone searches for "united states" it should return "united states",
> and not "united kingdom".
>
> I hope the analyzer can generate term with multiple words. say, united
> states to united states. I think standardanalyzer would analyze united
> states to united and states?
>
> A different example: if search keyword is parking lot in Salt Lake City,
> the generated terms to search need to be: parking lot and Salt Lake City,
> not parking,lot,salt,lake and city.
>
> I wonder if any analyzer can help me to implement my requirement. It would
> be better to use dictionary based solution, then I can manage some search
> terms that could have multiple words.
>
> thanks
>
> Ian




-- 
Regards,
Samar


Re: about analyzer for searching location

2010-04-19 Thread Samarendra Pratap
Well... you are 50% right.

when you write
*
*
* Query q = qp.parse("\"united states\"");*

It does search for two separate tokens "united" and "states" but checks if
those are written sequentially. So above search will search for documents
where token "states" is written after "united".

*Note* that since it checks tokens sequentially it may also find documents
where some non-tokenizable characters or stop words exist between "united"
and "states", e.g. - *united and states *(here "and" is a stop word).

TermQuery will work it the way you said in your reply, i.e. will search for
a token "united states" which is not what you want.



On Mon, Apr 19, 2010 at 3:33 PM, Ian.huang  wrote:

> Does a token of "united states" exist in index if using standard analyzer.
> My understanding is, united and states are separately stored in index, but
> not as "united states". So, if I build a query like Query q =
> qp.parse("\"united states\""); It would not return any result. Am I right?
>
> Ian
>
> --
> From: "Samarendra Pratap" 
> Sent: Friday, April 16, 2010 9:02 PM
> To: 
> Subject: Re: about analyzer for searching location
>
>  Hi. I don't think you need a different analyzer. Read about
>> PhraseQuery<
>> http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/PhraseQuery.html
>> >.
>>
>> If you are using parse() method of QueryParser. Enclose the searched
>> string
>> in extra double quotes, which must obviously be escaped.
>>
>> Query q = qp.parse("\"united states\"");
>>
>>
>> 2010/4/15 Ian.huang 
>>
>>  Hi All,
>>>
>>> I am implementing a search function for address by hibernate search which
>>> is based on lucene. The class definition as following:
>>>
>>> @Indexed
>>> public class Address implements Cloneable
>>> {
>>> @DocumentId
>>> private int id;
>>> @Field
>>> private String addrCountry;
>>> private String addrDesc;
>>> @Field
>>> private String addrLineOne;
>>> private String addrLineTwo;
>>> @Field
>>> private String addrCity;
>>> ..
>>>
>>> As you see, addrCountry, addrLineone and addrCity are fields for search.
>>> I
>>> am using default analyzer in index & search. So I think country name like
>>> United States would be indexed as two terms United, and states.
>>>
>>> In addition, during search, a search keyword like United states, or Salt
>>> lake city would be tokenized as two or three single words.
>>>
>>> As result, any address fields contain united, city would be returned.
>>> like
>>> United Kingdom, but actually I want to get a result of united states.
>>>
>>> My expected result as following:
>>>
>>> if someone searches for "united" it should return "united states" and
>>> "united kingdom".
>>>
>>> if someone searches for "united states" it should return "united states",
>>> and not "united kingdom".
>>>
>>> I hope the analyzer can generate term with multiple words. say, united
>>> states to united states. I think standardanalyzer would analyze united
>>> states to united and states?
>>>
>>> A different example: if search keyword is parking lot in Salt Lake City,
>>> the generated terms to search need to be: parking lot and Salt Lake City,
>>> not parking,lot,salt,lake and city.
>>>
>>> I wonder if any analyzer can help me to implement my requirement. It
>>> would
>>> be better to use dictionary based solution, then I can manage some search
>>> terms that could have multiple words.
>>>
>>> thanks
>>>
>>> Ian
>>>
>>
>>
>>
>>
>> --
>> Regards,
>> Samar
>>
>>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Reopening a Searcher for each request

2010-04-22 Thread Samarendra Pratap
Greetings to all.
 I have read at so many places that we should not open a Searcher for each
request for the sake of performance, but I have always been wondering
whether it is actually Searcher or Reader?

 I have a group of index amounting to 23G which actually contains of
different index directories. The structure is something like following

Main directory
|
|_ country1
| |___ country1-time1 (actual index)
| |___ country1-time2 (actual index)
| |___ country1-time3 (actual index)
|
|_ country2
 |___ country2-time1 (actual index)
 |___ country2-time2 (actual index)
 |___ country2-time3 (actual index)

 When application starts I open IndexReaders on all of actual index
directories (country1-time1, country1-tim2,  country2-time3) and keep
them in a pool.

 At the time of search, IndexSearchers are created by selecting the
appropriate IndexReaders from the pool. These IndexSearchers in turn are
used to create a ParallelMultiSearcher. Constructors of IndexSearcher and
ParallelMultiSearcher are run for every request.

 Now I believe that creating a pool of ParallelMultiSearcher itself is a
good idea but* I wanted to know if reopening **IndexSearchers** will really
degrade performance irrespective of **IndexReaders** being opened once*.

In my performance tests (which may not be very comprehensive) I didn't find
any noticeable difference.

Please throw some light.


-- 
Regards,
Samar


Re: Reopening a Searcher for each request

2010-04-22 Thread Samarendra Pratap
Thanks Mike.
That solved a query which was itching my mind for a long time.

On Thu, Apr 22, 2010 at 4:41 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It's the IndexReader that's costly to open/warm, so ideally it should
> be opened once and shared.
>
> The Searchers do very little on construction so re-creating per query
> should be OK.
>
> Mike
>
> On Thu, Apr 22, 2010 at 6:38 AM, Samarendra Pratap 
> wrote:
> > Greetings to all.
> >  I have read at so many places that we should not open a Searcher for
> each
> > request for the sake of performance, but I have always been wondering
> > whether it is actually Searcher or Reader?
> >
> >  I have a group of index amounting to 23G which actually contains of
> > different index directories. The structure is something like following
> >
> > Main directory
> > |
> > |_ country1
> > | |___ country1-time1 (actual index)
> > | |___ country1-time2 (actual index)
> > | |___ country1-time3 (actual index)
> > |
> > |_ country2
> > |___ country2-time1 (actual index)
> > |___ country2-time2 (actual index)
> > |___ country2-time3 (actual index)
> >
> >  When application starts I open IndexReaders on all of actual index
> > directories (country1-time1, country1-tim2,  country2-time3) and keep
> > them in a pool.
> >
> >  At the time of search, IndexSearchers are created by selecting the
> > appropriate IndexReaders from the pool. These IndexSearchers in turn are
> > used to create a ParallelMultiSearcher. Constructors of IndexSearcher and
> > ParallelMultiSearcher are run for every request.
> >
> >  Now I believe that creating a pool of ParallelMultiSearcher itself is a
> > good idea but* I wanted to know if reopening **IndexSearchers** will
> really
> > degrade performance irrespective of **IndexReaders** being opened once*.
> >
> > In my performance tests (which may not be very comprehensive) I didn't
> find
> > any noticeable difference.
> >
> > Please throw some light.
> >
> >
> > --
> > Regards,
> > Samar
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Reopening a Searcher for each request

2010-04-24 Thread Samarendra Pratap
No! It's not like this in my code. This code opens an IndexReader every time
I call newIndexSearcher().

In my code it is sometime like -


IndexReader[] irs;
// irs is a global array containing IndexReaders which are opened when the
application starts
.
.
IndexSearcher[] getIndexSearchers(IndexReader[] irs)
{
IndexSearcher[] iss = new IndexSearcher[irs.length];
for(int i=0;i) throws IOException
{
// by checking country and time related parameters, correct elements
are chosen from complete array of IndexReaders (irs) to pass in the function
for(int i=0;i)
{
return (new
ParallelMultiSearcher(getIndexSearchers((IndexReader[])irs[i])));
}
}

// ideally this code should never be executed
return (new
ParallelMultiSearcher(getIndexSearchers(prepareReaders(;
}



2010/4/24 Ivan Liu 

> like this?
>  public synchronized IndexSearcher newIndexSearcher() {
>  try {
> //   semaphore.acquire();
>   if (null == indexSearcher) {
>Directory directory = FSDirectory.open(new
> File(Config.DB_DIR+"/rssindex"));
>indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
>   } else {
>IndexReader indexReader = indexSearcher.getIndexReader();
>IndexReader newIndexReader = indexReader.reopen();
>if (newIndexReader!=indexReader) {
>
> indexReader.close();
> indexSearcher.close();
>
>
> indexSearcher = new IndexSearcher(newIndexReader);
>}
>   }
>   return indexSearcher;
>  } catch (CorruptIndexException e) {
>   log.error(e.getMessage(),e);
>   return null;
>  } catch (IOException e) {
>   log.error(e.getMessage(),e);
>   return null;
>  }finally{
> //   semaphore.release();
>  }
>  }
>
> 2010/4/22 Samarendra Pratap 
>
> > Thanks Mike.
> > That solved a query which was itching my mind for a long time.
> >
> > On Thu, Apr 22, 2010 at 4:41 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > It's the IndexReader that's costly to open/warm, so ideally it should
> > > be opened once and shared.
> > >
> > > The Searchers do very little on construction so re-creating per query
> > > should be OK.
> > >
> > > Mike
> > >
> > > On Thu, Apr 22, 2010 at 6:38 AM, Samarendra Pratap <
> samarz...@gmail.com>
> > > wrote:
> > > > Greetings to all.
> > > >  I have read at so many places that we should not open a Searcher for
> > > each
> > > > request for the sake of performance, but I have always been wondering
> > > > whether it is actually Searcher or Reader?
> > > >
> > > >  I have a group of index amounting to 23G which actually contains of
> > > > different index directories. The structure is something like
> following
> > > >
> > > > Main directory
> > > > |
> > > > |_ country1
> > > > | |___ country1-time1 (actual index)
> > > > | |___ country1-time2 (actual index)
> > > > | |___ country1-time3 (actual index)
> > > > |
> > > > |_ country2
> > > > |___ country2-time1 (actual index)
> > > > |___ country2-time2 (actual index)
> > > > |___ country2-time3 (actual index)
> > > >
> > > >  When application starts I open IndexReaders on all of actual index
> > > > directories (country1-time1, country1-tim2,  country2-time3) and
> > keep
> > > > them in a pool.
> > > >
> > > >  At the time of search, IndexSearchers are created by selecting the
> > > > appropriate IndexReaders from the pool. These IndexSearchers in turn
> > are
> > > > used to create a ParallelMultiSearcher. Constructors of IndexSearcher
> > and
> > > > ParallelMultiSearcher are run for every request.
> > > >
> > > >  Now I believe that creating a pool of ParallelMultiSearcher itself
> is
> > a
> > > > good idea but* I wanted to know if reopening **IndexSearchers** will
> > > really
> > > > degrade performance irrespective of **IndexReaders** being opened
> > once*.
> > > >
> > > > In my performance tests (which may not be very comprehensive) I
> didn't
> > > find
> > > > any noticeable difference.
> > > >
> > > > Please throw some light.
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Samar
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Regards,
> > Samar
> >
>
>
>
> --
> 冲浪板
>
> my blog:冲浪板 <http://chonglangban.appspot.com/>
> my site:Keji Technology <http://kejiblog.appspot.com/>
>



-- 
Regards,
Samar


Right memory for search application

2010-04-27 Thread Samarendra Pratap
Hi.
 I am searching for some guidance on right memory options for my Search
Server application. How much memory a lucene based application should be
given?

 Till a few days back I was running my search server on java 1.4 with memory
options "-Xmx3600m" which was running quite fine. After upgrading the JVM to
*java 6* we noticed a few times that our application fell idle (perhaps
hung). It was not serving the requests although the port was open. I changed
the "-Xmx" from 3600m to 5000m.
 Currently it is running OK but this created a curiosity in my mind that how
to find the right memory size for a lucene based application (or any java
application). I believe it heavily depends on index size but how do I
calculate it (at least approximately) without hit & trial?

The details about my application and server are following

*[ system configuration ]*
# uname -a
Linux xxx 2.6.17-13.2.xxx.finalsmp #1 SMP Wed May 9 17:27:56 IST 2007 x86_64
x86_64 x86_64 GNU/Linux

*[ java version ]*
# /usr/lib/jre1.6.0_20/bin/java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

*[ application command line ]*
# java -Xmx5000m -server -classpath
lib/lucene-core-2.9.2.jar:lib/mysql-connector-j
ava-3.1.11-bin.jar:lib/search.jar com.xxx.xxx.SearchServer conf/search.conf

*[ memory information ]*
*
# cat /proc/meminfo
MemTotal:  8156660 kB
MemFree: 46924 kB
Buffers: 53768 kB
Cached:3194224 kB
SwapCached: 60 kB
Active:6086272 kB
Inactive:   919276 kB
SwapTotal: 2096472 kB
SwapFree:  2095432 kB
Dirty:1148 kB
Writeback:   0 kB
AnonPages: 3756496 kB
Mapped:  24800 kB
Slab:  1058124 kB
SReclaimable:15144 kB
SUnreclaim:1042980 kB
PageTables:  12232 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   6174800 kB
Committed_AS:  4588900 kB
VmallocTotal: 34359738367 kB
VmallocUsed:264904 kB
VmallocChunk: 34359473387 kB

More info*
Search Server application is running on a few servers each of which contains
same of copy of index.
Total size of index: 23 GB
Total Documents in index: 18,000,000
Maximum fields per index: 56 (all analyzed, 4 small fields stored, field
"contents" is the largest one, created from a text of as large as 5 - 8 KB)
Query frequency: 1 query per second per server (in peak hours)
Queries are taking around 800 - 1000 ms per query

*Other memory related things in Search application*
There is really nothing else which will consume noticeable memory. No
database connection is made. The application simply returns the IDs from the
index based on the query.

Any help on choosing right memory option is much appreciated.

-- 
Regards,
Samar


Re: Right memory for search application

2010-04-27 Thread Samarendra Pratap
Hi Ian. Thanks for the points

Here are my answers -

1. Our default option is sort by score, however almost 8% of searches use
sorting on a field (mmddHHMMSS). This field is indexed as string (not as
NumericField or DateField).

2. We are opening readers at the time of starting the application, but
Searchers are opened for every request choosing the right readers. A few
days back I got an answer on this very forum that reopening a searcher is
not as big issue as reopening a reader. However we are *not closing Searcher
*s after request is served. Should that cause memory problem? Shouldn't GC
handle that all as the reference of Searcher becomes inaccessible?

3. I can't do a profiling on production servers but running "top" shows that
immediately starting search server resident memory (RES) starts from 500m
and Virtual memory (VIRT) is around 5400m. It grows to RES => 1g in 5
minutes and 4g within half an hour while VIRT remains same.

4. Right! I'll be doing a memory profiling. I hope I'll get some hints from
there too.

Do above points (specially # 2) indicate towards anything which I can do
besides profiling?



On Tue, Apr 27, 2010 at 6:53 PM, Ian Lea  wrote:

> There is no simple answer.  However your app does sound to be using
> rather a lot of memory for what you describe as simple searches.
>
> Are you using lucene sorting?  That can use lots of memory.  How are
> you using/reusing searchers/readers?  Having multiple ones open, or
> failing to close old ones, will use more memory.  Does memory usage
> grow then stabilize or keep on growing?
>
> A memory profiler/heap dump could tell you what is really using all the
> space.
>
>
> --
> Ian.
>
> On Tue, Apr 27, 2010 at 1:51 PM, Samarendra Pratap 
> wrote:
> > Hi.
> >  I am searching for some guidance on right memory options for my Search
> > Server application. How much memory a lucene based application should be
> > given?
> >
> >  Till a few days back I was running my search server on java 1.4 with
> memory
> > options "-Xmx3600m" which was running quite fine. After upgrading the JVM
> to
> > *java 6* we noticed a few times that our application fell idle (perhaps
> > hung). It was not serving the requests although the port was open. I
> changed
> > the "-Xmx" from 3600m to 5000m.
> >  Currently it is running OK but this created a curiosity in my mind that
> how
> > to find the right memory size for a lucene based application (or any java
> > application). I believe it heavily depends on index size but how do I
> > calculate it (at least approximately) without hit & trial?
> >
> > The details about my application and server are following
> >
> > *[ system configuration ]*
> > # uname -a
> > Linux xxx 2.6.17-13.2.xxx.finalsmp #1 SMP Wed May 9 17:27:56 IST 2007
> x86_64
> > x86_64 x86_64 GNU/Linux
> >
> > *[ java version ]*
> > # /usr/lib/jre1.6.0_20/bin/java -version
> > java version "1.6.0_20"
> > Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> > Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
> >
> > *[ application command line ]*
> > # java -Xmx5000m -server -classpath
> > lib/lucene-core-2.9.2.jar:lib/mysql-connector-j
> > ava-3.1.11-bin.jar:lib/search.jar com.xxx.xxx.SearchServer
> conf/search.conf
> >
> > *[ memory information ]*
> > *
> > # cat /proc/meminfo
> > MemTotal:  8156660 kB
> > MemFree: 46924 kB
> > Buffers: 53768 kB
> > Cached:3194224 kB
> > SwapCached: 60 kB
> > Active:6086272 kB
> > Inactive:   919276 kB
> > SwapTotal: 2096472 kB
> > SwapFree:  2095432 kB
> > Dirty:1148 kB
> > Writeback:   0 kB
> > AnonPages: 3756496 kB
> > Mapped:  24800 kB
> > Slab:  1058124 kB
> > SReclaimable:15144 kB
> > SUnreclaim:1042980 kB
> > PageTables:  12232 kB
> > NFS_Unstable:0 kB
> > Bounce:  0 kB
> > CommitLimit:   6174800 kB
> > Committed_AS:  4588900 kB
> > VmallocTotal: 34359738367 kB
> > VmallocUsed:264904 kB
> > VmallocChunk: 34359473387 kB
> >
> > More info*
> > Search Server application is running on a few servers each of which
> contains
> > same of copy of index.
> > Total size of index: 23 GB
> > Total Documents in index: 18,000,000
> > Maximum fields per index: 56 (all analyzed, 4 small fields stored, field
> > "contents" is the largest one, created from a text of as large as 5 - 8
> KB)
> > Query frequency: 1 query

Re: Right memory for search application

2010-04-27 Thread Samarendra Pratap
I have got a lot of valuable information in this thread so far.
Thanks to all.

In my last mail I mentioned only two fields because others' usage was
negligible and I thought they are not important. But now after *Toke *explained
the formulae, I think sorting on those fields would also be consuming a huge
part of memory.There are 2 other sorting fields; one of which is used in
both ascending/descending sorting.

Within next couple of days (or may be a week) I'll be

1. profiling my application,
2. analyzing and tuning GC options



However, I have a few more curiosities -

1. Tom wrote:

*Have you checked that your machine is correctly identified as a server*
*and has optimized GC settings?*

*I did not understand the meaning of "correctly identified as a server" Can
you please help me understand?*
*
*
2. *Should I change the type of fields?*
** As I said in my first mail that I have 56 fields in my index, most of
them contain a numeric value or one of system defined values (e.g. gender
field can contain only "male", "female", or "unknown"). There are only 7
fields which are indexed with user defined values.
All the fields are created with *Field*
(String name, String value, Field.Store store, Field.Index index)
It would be creating all the fields as normal string fields. Is it
*always*a good idea to use specific classes (NumericField, DateTime
etc.). We do not
have space problem if that matters.

3. *Is there any advice on number of fields?*
*Somewhere on the net I read that instead of keeping different type of
values in different fields, (e.g. field1:value1, field2:value2,...) one
should practice keeping different values in single field (e.g.
field:field1_value1,
field:field2_value2,...). But I could not confirm it from anywhere else. Any
comments?*

4. Ian wrote:

*Sorting by score down to the second will use a lot of memory.  Can you*
*make it less granular?*

Is it less painful sorting on two fields; first on yymmdd and then on
yymmddHHMMSS than sorting just on latter? (Naturally it should use second
field, only where required but technically ...?)


Thanks again for the invaluable support I am getting from here.

- Samar

On Wed, Apr 28, 2010 at 9:12 AM, Lance Norskog  wrote:

> Solr's timestamp representation (TrieDateField) is tuned for space and
> speed. It has a compressed representation, and sorts with far less
> space than Strings.
>
> Also you get something called a date facet, which lets you bucketize
> facet searches by time block.
>
> On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen 
> wrote:
> > Samarendra Pratap [samarz...@gmail.com] wrote:
> >> 1. Our default option is sort by score, however almost 8% of searches
> use
> >> sorting on a field (mmddHHMMSS). This field is indexed as string
> (not as
> >> NumericField or DateField).
> >
> > Guessing that the timestamp is practically unique for each document,
> sorting by String takes up a bit more than
> > 18M * (40 bytes + 2 * "mmddHHMMSS".length() bytes) ~= 1.2 GB of RAM
> as the Strings are cached. Coupled with the normal overhead of just opening
> an index of your size (500MB by your measurements?), I would have guessed
> that 3600MB would definitely be enough to open the index and do sorted
> searches.
> >
> > I realize that fiddling with production servers is dangerous, but
> connecting with JConsole and forcing a garbage collection might be
> acceptable? That should enable you to determine whether you're leaking
> memory or if it's just the JVM being greedy. I'd guess you leaking though,
> as HotSpot does not normally allocate up to the limit if it does not need
> to.
> >
> > Anyway, changing to one of the optimized fields for sorting dates should
> shave 1 GB off the memory requirement, so I'll recommend doing that no
> matter what the main cause of your memory problems is.
> >
> > Regards,
> > Toke Eskildsen
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Right memory for search application

2010-04-28 Thread Samarendra Pratap
Great explanation Erick.
Thanks. I'll try that.

On Wed, Apr 28, 2010 at 8:27 PM, Erick Erickson wrote:

> Quick reply to question (4). Not quite right, you're not gaining anything,
> you still have a yymmddHHMMSS field that will consume all the memory
> when you sort by it.
>
> Remember that the controlling part here is the number of unique values. So
> think about two fields, yymmdd and HHMMSS. Use the HHMMSS as the
> secondary sort and yymmdd as the primary. This will sort correctly since
> any time HHMMSS is used, the yymmdd is guaranteed be equal.
>
> And you can extend this ad nauseum. For instance, you could use 6
> fields, yy, mm, dd, HH, MM, SS and have a very small number of
> unique values in each using really tiny amounts of memory to sort down
> to the second in this case.
>
> Best
> Erick
>
> On Wed, Apr 28, 2010 at 2:24 AM, Samarendra Pratap  >wrote:
>
> > I have got a lot of valuable information in this thread so far.
> > Thanks to all.
> >
> > In my last mail I mentioned only two fields because others' usage was
> > negligible and I thought they are not important. But now after *Toke
> > *explained
> > the formulae, I think sorting on those fields would also be consuming a
> > huge
> > part of memory.There are 2 other sorting fields; one of which is used in
> > both ascending/descending sorting.
> >
> > Within next couple of days (or may be a week) I'll be
> >
> > 1. profiling my application,
> > 2. analyzing and tuning GC options
> >
> >
> >
> > However, I have a few more curiosities -
> >
> > 1. Tom wrote:
> >
> > *Have you checked that your machine is correctly identified as a server*
> > *and has optimized GC settings?*
> >
> > *I did not understand the meaning of "correctly identified as a server"
> Can
> > you please help me understand?*
> > *
> > *
> > 2. *Should I change the type of fields?*
> > ** As I said in my first mail that I have 56 fields in my index, most of
> > them contain a numeric value or one of system defined values (e.g. gender
> > field can contain only "male", "female", or "unknown"). There are only 7
> > fields which are indexed with user defined values.
> > All the fields are created with *Field*
> > (String name, String value, Field.Store store, Field.Index index)
> > It would be creating all the fields as normal string fields. Is it
> > *always*a good idea to use specific classes (NumericField, DateTime
> > etc.). We do not
> > have space problem if that matters.
> >
> > 3. *Is there any advice on number of fields?*
> > *Somewhere on the net I read that instead of keeping different type of
> > values in different fields, (e.g. field1:value1, field2:value2,...) one
> > should practice keeping different values in single field (e.g.
> > field:field1_value1,
> > field:field2_value2,...). But I could not confirm it from anywhere else.
> > Any
> > comments?*
> >
> > 4. Ian wrote:
> >
> > *Sorting by score down to the second will use a lot of memory.  Can you*
> > *make it less granular?*
> >
> > Is it less painful sorting on two fields; first on yymmdd and then on
> > yymmddHHMMSS than sorting just on latter? (Naturally it should use second
> > field, only where required but technically ...?)
> >
> >
> > Thanks again for the invaluable support I am getting from here.
> >
> > - Samar
> >
> > On Wed, Apr 28, 2010 at 9:12 AM, Lance Norskog 
> wrote:
> >
> > > Solr's timestamp representation (TrieDateField) is tuned for space and
> > > speed. It has a compressed representation, and sorts with far less
> > > space than Strings.
> > >
> > > Also you get something called a date facet, which lets you bucketize
> > > facet searches by time block.
> > >
> > > On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen <
> t...@statsbiblioteket.dk>
> > > wrote:
> > > > Samarendra Pratap [samarz...@gmail.com] wrote:
> > > >> 1. Our default option is sort by score, however almost 8% of
> searches
> > > use
> > > >> sorting on a field (mmddHHMMSS). This field is indexed as string
> > > (not as
> > > >> NumericField or DateField).
> > > >
> > > > Guessing that the timestamp is practically unique for each document,
> > > sorting by String takes up a bit more than
> > > > 18M * (40 bytes + 2 * "mmddHHMMSS".length() bytes) ~= 1.2 GB of
> RA

Grouping results on the basis of a field

2005-11-21 Thread Samarendra Pratap
Hi,
  I am using lucene 1.4.3. The basic functionality of  the search is 
simple, put in the keyword as “java” and it will display  you all the books 
having java keyword.
  Now I have to add a feature which also shows the name of top authors  (lets 
say top 5 authors) with the number of books, who has maximum  number of books 
in this result set.
  Currently I m traversing through all the results,  finding the value of 
author field and putting them in a hash (to keep  count of authors and number 
of their books), but this takes too much  time. Is there any efficient way of 
doing this?
  I have 15 documents in my index. The field “author” can obviously 
contain more than one value.
  
  My current code block for finding top 5 authors for maximum books is 
following.
  
  tfd = searcher.search(query, null, nDocs, new Sort(sortCol));
   StringTokenizer fieldToken = null;
  HashMap fieldMap = new HashMap();
  
  for(int i=0;i