Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
[EMAIL PROTECTED] wrote: Thanks Anthony for your response, I did not know about that field. You make your own fields in Lucene, it is not something Lucene gives you. But still I have a problem and it is about privacy. The users are concerned about privacy and so, we thought we could have all

Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
Cyndy wrote: I want to keep user text files indexed separately, I will have about 10,000 users and each user may have about 20,000 short files, and I need to keep privacy. So the idea is to have one folder with the text files and index for each user, so when search will be done, it will be poin

Score Boosting

2008-08-18 Thread blazingwolf7
Hi, I am currently working on the calculation of score part in Lucene. And I encounter a part that I do not understand. return raw * Similarity.decodeNorm(norms[doc]); // normalize for field As can be seen from the code above, the Similarity method decodeNorm() will be called to decode the

Multiple index performance

2008-08-18 Thread Cyndy
Hello, I am new into Lucene and I want to make sure what I am trying to do will not hit performance. My scenario is the following: I want to keep user text files indexed separately, I will have about 10,000 users and each user may have about 20,000 short files, and I need to keep privacy. So the

Simple Query Question

2008-08-18 Thread DanaWhite
For some reason I am thinking I read somewhere that if you queried something like: "Eiffel Tower" Lucene would execute the query "Eiffel AND Tower" Basically I am trying to ask, does lucene automatically replaces spaces with the AND operator? Thanks Dana -- View this message in context: http

Re: Testing for field existence

2008-08-18 Thread Daniel Noll
Karsten F. wrote: Hi Bill, you should not use prefix-query (*), because in first step lucene would generate a list of all terms in this field, and than search for all this terms. Which is senceless. That's not quite an accurate description of what it does as it nowhere near as slow as doi

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Antony Bowesman
Doron Cohen wrote: The API definitely doesn't promise this. AFAIK implementation wise it happens to be like this but I can be wrong and plus it might change in the future. It would make me nervous to rely on this. I made some tests and it 'seems' to work, but I agree, it also makes me nervous

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
Mmmkay. I think I'll wait, then. Thank you so much for your help. I really appreciate it. Also, I really dig Lucene, so thanks for your hard work! -Matt Michael McCandless-2 wrote: > > > mattspitz wrote: > >> Is there no way to ensure consistency on the disk with 2.3.2? > > Unfortunately

Re: Appropriate disk optimization for large index?

2008-08-18 Thread Michael McCandless
mattspitz wrote: Is there no way to ensure consistency on the disk with 2.3.2? Unfortunately no. This is a little off-topic, but is it worth upgrading to 2.4 right now if I've got a very stable system already implemented with 2.3.2? I don't really want to introduce oddities because I'm u

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
Thanks for your replies! Is there no way to ensure consistency on the disk with 2.3.2? This is a little off-topic, but is it worth upgrading to 2.4 right now if I've got a very stable system already implemented with 2.3.2? I don't really want to introduce oddities because I'm using an "unfinish

Re: Appropriate disk optimization for large index?

2008-08-18 Thread Michael McCandless
mattspitz wrote: Are the index files synced on writer.close()? No, they aren't. Not until 2.4 (trunk). Thank you so much for your help. I think the seek time is the issue, especially considering the high merge factor and the fact that the segments are scattered all over the disk. You

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
Mike- Are the index files synced on writer.close()? Thank you so much for your help. I think the seek time is the issue, especially considering the high merge factor and the fact that the segments are scattered all over the disk. Will a faster disk cache access affect the optimization and merg

Re: Appropriate disk optimization for large index?

2008-08-18 Thread Michael McCandless
mattspitz wrote: So, my indexing is done in "rounds", where I pull a bunch of documents from the database, index them, and flush them to disk. I manually call "flush()" because I need to ensure that what's on disk is accurate with what I've pulled from the database. On each round, then,

Re: Appropriate disk optimization for large index?

2008-08-18 Thread mattspitz
So, my indexing is done in "rounds", where I pull a bunch of documents from the database, index them, and flush them to disk. I manually call "flush()" because I need to ensure that what's on disk is accurate with what I've pulled from the database. On each round, then, I flush to disk. I set t

Re: Appropriate disk optimization for large index?

2008-08-18 Thread Otis Gospodnetic
Matt, One important bit that you didn't mention is what your maxBufferedSize setting is. If it's too low you will see lots of IO. Increasing it means less IO, but more JVM heap need. Is your disk IO caused by searches or indexing only? Otis -- Sematext -- http://sematext.com/ -- Lucene - Sol

Re: Index of Lucene

2008-08-18 Thread Otis Gospodnetic
Is that really 1 byte for each document? Not 1 byte for each field of each document? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Doron Cohen <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, August 18, 2008

RE: Testing for field existence

2008-08-18 Thread Steven A Rowe
Hi Bill, A simpler suggestion, assuming you need to test for the existence of just one particular field: rather than adding a field containing a list of all indexed fields for a particular document, as Karsten suggested, you could just add a field with a constant value when the field you want t

Re: Testing for field existence

2008-08-18 Thread Erick Erickson
Karsten's saying that prefix and wildcard queries require a bunch of work. Specifically, Lucene assembles a list of all terms that match the query and then, conceptually at least, forms a huge OR query with one clause for each term. Say, for instance, you have the following values indexed in a fiel

RE: Testing for field existence

2008-08-18 Thread Bill.Chesky
Karsten, Thanks for the feedback. Not sure I understand the reasoning behind not using the "" prefix (do you have a link possibly?). But I see what you are getting at with the additional field. I'll give it a try. Thanks for the help. regards, Bill -Original Message- From: Kars

Re: Testing for field existence

2008-08-18 Thread Karsten F.
Hi Bill, you should not use prefix-query (*), because in first step lucene would generate a list of all terms in this field, and than search for all this terms. Which is senceless. I would suggest to insert a new field "myFields" which contains as value the names of all fields for this docum

Re: Search Result Filtering

2008-08-18 Thread Ian Lea
Hi Lucene range queries and filters work on string comparison, not numeric. You'll need to pad out any numeric fields you want to use in a range to a consistent length. There may be a class floating around that does this - NumberUtils or NumberTools or something like that. -- Ian. On Mon, A

Search Result Filtering

2008-08-18 Thread nukie
Hi! I've made a sample program for testing lucene : package indexer; import com.sun.xml.internal.bind.v2.schemagen.xmlschema.Occurs; import com.sun.xml.internal.ws.util.StringUtils; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.util.Random; import

Testing for field existence

2008-08-18 Thread Bill.Chesky
Hello, I am creating fields for documents like this: String name = ... String value = ... doc.add(new Field(name, value, Field.Store.NO, Field.Index.UN_TOKENIZED)); On the query side, sometimes I want to want to search for documents for which a given field, say 'foo' is equal to a giv

Re: windows file system cache

2008-08-18 Thread Mark Miller
Mark Miller wrote: Mark Miller wrote: Robert Stewart wrote: Anyone else run on Windows? We have index around 26 GB in size. Seems file system cache ends up taking up nearly all available RAM (26 GB out of 32 GB on 64-bit box). Lucene process is around 5 GB, so very little left over for que

Re: windows file system cache

2008-08-18 Thread Michael McCandless
Toke Eskildsen wrote: Lucene process is around 5 GB, so very little left over for queries, etc, and box starts swapping during searches. Not so fine and also unexpected. Are you sure that what you're seeing is swapping and not just flushing of the write-cache? Are you observing the disk-a

Re: windows file system cache

2008-08-18 Thread Toke Eskildsen
On Sat, 2008-08-16 at 07:40 -0400, Robert Stewart wrote: > Anyone else run on Windows? We have index around 26 GB in size. > Seems file system cache ends up taking up nearly all available RAM > (26 GB out of 32 GB on 64-bit box). Sounds fine so far. If the RAM isn't used for anything else, the

Re: Index of Lucene

2008-08-18 Thread Doron Cohen
On Mon, Aug 18, 2008 at 7:28 AM, blazingwolf7 <[EMAIL PROTECTED]>wrote: > > Thanks for the info. But do you know where this is actually perform in > Lucene? I mean the method involved, that will calculate the value before > storing it into the index. I track it to one method known as lengthNorm()

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Doron Cohen
> > payload and the other part for storing, i.e. something like this: >> >>Token token = new Token(...); >>token.setPayload(...); >>SingleTokenTokenStream ts = new SingleTokenTokenStream(token); >> >>Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO); >>Field f2