RE: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Uwe Schindler
Yes, just add the field two times to Document (with the same name), to achieve this. Using the same name is no problem, as between stored and inverted fields no relation exists. Lucene always created internally "two fields" with the same name. You can still do this, but if you want to compress, yo

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Otis Gospodnetic
Well, I think some people will be for hiding complexity, while others will be for being in control and having transparency. Think how surprised one would be to find 1 extra field in his index, say when looking at their index with Luke. :) Otis -- Sematext is hiring -- http://sematext.com/about

Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Dinh
Hi, Thanks for your feedbacks. I have checked it again and found that this behavior is rather consistent. So may be OS cache and Lucene warm up have big impact. Regards, Dinh

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
I understand the reasons, but - if I may ask so late in the game - was this the best way to do this? >From a user (developer) perspective, this is an implementation issue. Couldn't this have been done behind the scenes, so that when I asked for Field.Index.ANALYZED && Field.Store.COMPRESS, instea

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Chris Lu
So will I need to use 2 fields, one filed is analyzed and the other field is binary, to replace one compressed fields previously? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucen

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Mark Miller
Here is some of the history: https://issues.apache.org/jira/browse/LUCENE-652 https://issues.apache.org/jira/browse/LUCENE-1960 Glen Newton wrote: > Could someone send me where the rationale for the removal of > COMPRESSED fields is? I've looked at > http://people.apache.org/~uschindler/staging-a

RE: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Uwe Schindler
Because you can do the compression yourself by just adding a binary stored field with the compressed content. And then you can use any algorithm, even bz2 or whatever. The problem is that the compressed fields made lot's of problems and special cases during merging, because they were always decomp

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
Could someone send me where the rationale for the removal of COMPRESSED fields is? I've looked at http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior but it is a little light on the 'why' of this change. My fault - of course - f

Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Uwe Schindler
Hello Lucene users, On behalf of the Lucene dev community (a growing community far larger than just the committers) I would like to announce the first release candidate for Lucene Java 3.0. Please download and check it out - take it for a spin and kick the tires. If all goes well, we hope

Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Erick Erickson
The "usual" recommendation is just to fire up a series of warmup queries at startup if you really require the first queries to be fast. Best Erick On Tue, Nov 17, 2009 at 2:43 PM, Scott Ribe wrote: > > Most likely due to the operating system caching the relevant portions of > the > > index after

Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Scott Ribe
> Most likely due to the operating system caching the relevant portions of the > index after the first set of queries. I have enough RAM to keep the Lucene indexes in memory all the time, so I "dd ... > /dev/null" the files at boot. And also perform a single query to force JIT of the query code. T

Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Otis Gospodnetic
Hello, Most likely due to the operating system caching the relevant portions of the index after the first set of queries. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: Din

Re: token positions

2009-11-17 Thread Michael McCandless
The character offset info is only stored if you enable Field.TermVector.WITH_OFFSETS or WITH_POSITIONS_OFFSETS on the field. Then, it can only be retrieved if you get the term vectors for that document, and locate the term & specific occurrence that you're interested in. This is likely quite a bi

token positions

2009-11-17 Thread Christopher Tignor
Hello, Hoping someone might clear up a question for me: When Tokenizing we provide the start and end character offsets for each token locating it within the source text. If I tokenize the text "word" and then search for the term "word" in the same field, how can I recover this character offset i

Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Dinh
Hi all, I made a list of 4 simple, singe term queries and do 4 searches via Lucene and find that if the term is used for search in the first time, Lucene takes quite a bit time to handle it. - Query A 00:27:28,781 INFO LuceneSearchService:151 - Internal search took 328.21463ms 00:27:28,781 INFO

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Peter Keegan
> But if re-creating the entire file on each reopen isn't a problem for > you then there's no need to change this :) It's actually created after IndexWriter.commit(), but same idea. If we needed real-time indexing, or if disk I/O gets excessive, I'd go with separate files per segment. >Hmm -- if

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Michael McCandless
On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan wrote: > The external data is just an array of fixed-length records, one for each > Lucene document. Indexes are updated at regular intervals in one jvm. A > searcher jvm opens the index and reads all the fixed-length records into > RAM. Given an index

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Michael McCandless
On Tue, Nov 17, 2009 at 10:23 AM, Peter Keegan wrote: >>This is a generic solution, but just make sure you don't do the >>map lookup for every doc collected, if you can help it, else that'll >>slow down your search. > > What I just learned is that a Scorer is created for each segment (lights > on!

Token character positions

2009-11-17 Thread Christopher Tignor
Hello, Hoping someone might clear up a question for me: When Tokenizing we provide the start and end character offsets for each token locating it within the source text. If I tokenize the text "word" and then serach for the term "word" in the same field, how can I recover this character offset i

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Peter Keegan
>This is a generic solution, but just make sure you don't do the >map lookup for every doc collected, if you can help it, else that'll >slow down your search. What I just learned is that a Scorer is created for each segment (lights on!). So, couldn't I just do the subreader->docBase map lookup onc

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Peter Keegan
The external data is just an array of fixed-length records, one for each Lucene document. Indexes are updated at regular intervals in one jvm. A searcher jvm opens the index and reads all the fixed-length records into RAM. Given an index-wide docId, the custom scorer can quickly access the correspo

Re: What's 'java -server' option ?

2009-11-17 Thread Wenbo Zhao
right! this just emphasized the word 'ironic' :-) 2009/11/17 Michael McCandless : > Remember that, like Lucee, if you give this query to google: > >    java -server > > It means "find all docs that contain java and do not contain server". > I'm sure this has messed up a great many people trying t

Re: Use of AllTermDocs with custom scorer

2009-11-17 Thread Michael McCandless
On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan wrote: >>Can you remap your external data to be per segment? > > That would provide the tightest integration but would require a major > redesign. Currently, the external data is in a single file created by > reading a stored field after the Lucene in

Re: What's 'java -server' option ?

2009-11-17 Thread Michael McCandless
Remember that, like Lucee, if you give this query to google: java -server It means "find all docs that contain java and do not contain server". I'm sure this has messed up a great many people trying to figure out command line options ;) The fix is to put the -server in double quotes: ja

Custom sorting

2009-11-17 Thread Ganesh
Hello all, I am having millions of records in the database and in that 75% of the records required to be sorted. Does 2.9 provides facility to do custom sorting (Avoid loading all records) ? Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com

RE: javadoc questions/inconsistencies

2009-11-17 Thread Uwe Schindler
The PriorityQueue is fixed size, it cannot grow (please note, it is *not* Java's PQ, ist an own one!). TopDocs will contain only n documents in it's scoreDocs array, the reported total hit count will return all matches! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.

javadoc questions/inconsistencies

2009-11-17 Thread Cristian Vat
Hello all, Sorry if this is offtopic or already discussed/documented somewhere. Regarding lucene 2.9.1 javadoc: In Searcher the method "TopDocs search(Query query, int n)" says "Finds the top n hits for query." However if I do a search(someQuery, 100) which gets me 1000 results all results are a