Re: Searching for instances within a document

2008-07-10 Thread Ajay Lakhani
Hi James,

Try this:

Searcher searcher = new IndexSearcher(dir);
QueryParser parser = new QueryParser("content", new StandardAnalyzer());
Query query = parser.parse(queryString);

HashSet queryTerms = new HashSet();
query.extractTerms(queryTerms);

Hits hits = searcher.search(query);

IndexReader reader = IndexReader.open(dir);

for (int i =0; i < hits.length() ; i ++){
  Document d = hits.doc(i);
  Field fid = d.getField("id");
  Field ftitle = d.getField("title");
  System.out.println("id is " + fid.stringValue());
  System.out.println("title is " + ftitle.stringValue());

  TermFreqVector tfv = reader.getTermFreqVector(hits.id(i), "content");
  String[] terms = tfv.getTerms();
  int [] freqs = tfv.getTermFrequencies();//get the frequencies

  // for each term in the query
  for (Iterator iter = queryTerms.iterator(); iter.hasNext();) {
Term term = (Term) iter.next();

// for each term in the vector
for (int j = 0; j < terms.length; j++) {
  if (terms[j].equals(term.text())) {
System.out.println("frequency of term ["+ term.text() +"] is " +
freqs[j] );
  }
}
  }
}

Let me know if this helps.
Cheers
AJ

2008/7/10 Karl Wettin <[EMAIL PROTECTED]>:

> Maybe you are looking for the document TermFreqVector?
>
>
>   karl
>
> 9 jul 2008 kl. 15.49 skrev jnance:
>
>
>> Hi,
>>
>> I am indexing lots of text files and need to see how many times a certain
>> word comes up in each text file. Right now I have this constructor for
>> "search":
>>
>> static void search(Searcher searcher, String queryString) throws
>> ParseException, IOException {
>> QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>> Query query = parser.parse(queryString);
>> Hits hits = searcher.search(query);
>>
>> int hitCount = hits.length();
>> if (hitCount == 0) {
>> System.out.println("0 documents contain the word
>> \"" + queryString +
>> ".\"");
>> }
>> else {
>> System.out.println(hitCount + " documents contain
>> the word \"" +
>> queryString + ".\"");
>> }
>> }
>>
>> This tells me how many documents contain the word I'm looking for... but
>> how
>> do I get it to tell me how many times the word occurs within that
>> document?
>>
>> Thanks,
>>
>> James
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Highlighting terms with different style

2008-07-10 Thread jim

Hi

Is it possible to Hightlight more than one terms with highlighter but  
with different style for each term ??


1st term with SimpleHTMLFormatter("", "");
2rd term with SimpleHTMLFormatter("", "");
..
n-th term  with SimpleHTMLFormatter("", "");

or  for foloween code

 SimpleHTMLFormatter formatter =
new SimpleHTMLFormatter("class=\"highlight\">",  "");

Highlighter highlighter = new Highlighter(formatter, scorer);

to use something like this:

SimpleHTMLFormatter [] formatters = {new SimpleHTMLFormatter("class=\"highlight1\">", ""),
  new  
SimpleHTMLFormatter("",""),

 
 new  
SimpleHTMLFormatter("", "") }


Highlighter highlighter = new Highlighter(formatters, scorer);

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: newbie question

2008-07-10 Thread Chris Bamford

Hi John,

Just continuing from an earlier question where I asked you how to handle 
strings like "from:fred flintston*" (sorry I have lost the original email).
You advised me to write my own BooleanQuery and add to it Prefix- / 
Term- / Phrase- Querys as appropriate.  I have done so, but am having 
trouble with the result - my PhraseQueries just do not get any hits at 
all  :-(
My code looks for quotes - if it finds them, it treats the quoted phrase 
as a PhraseQuery and sets the slop factor to 0.

so,  an input of:

   subject:"Good Morning"

results in a PhraseQuery (which I add to my BooleanQuery and then dump 
with toString()) of:


   +subject:"good morning"

... which fails.
However, if I break it into 2 TermQuerys, it works (but that's not what 
I want).


What am I missing?

Thanks,

- Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can we update a field on the current index

2008-07-10 Thread Aditi Goyal
Hi,

I want to modify a field on the current index. Can it be done?
For what I have heard that we cannot update the index . It has to be
reindexed by deleting and then indexing again.


Thanks,
Aditi


Re: newbie question (for John Griffin)

2008-07-10 Thread Chris Bamford

Hi John,

Further to my question below, I did some back-to-basics investigation of 
PhraseQueries and found that even basic ones fail for me...
I found the attached code on the Internet (see 
http://affy.blogspot.com/2003/04/codebit-examples-for-all-of-lucenes.html) 
and this fails too...  Can you explain why?  I would expect the first 
test to deliver 2 hits.


I have tried with Lucene 2.0 and 2.3.2 jars and both fail.

Thanks again,

- Chris



Chris Bamford wrote:

Hi John,

Just continuing from an earlier question where I asked you how to 
handle strings like "from:fred flintston*" (sorry I have lost the 
original email).
You advised me to write my own BooleanQuery and add to it Prefix- / 
Term- / Phrase- Querys as appropriate.  I have done so, but am having 
trouble with the result - my PhraseQueries just do not get any hits at 
all  :-(
My code looks for quotes - if it finds them, it treats the quoted 
phrase as a PhraseQuery and sets the slop factor to 0.

so,  an input of:

   subject:"Good Morning"

results in a PhraseQuery (which I add to my BooleanQuery and then dump 
with toString()) of:


   +subject:"good morning"

... which fails.
However, if I break it into 2 TermQuerys, it works (but that's not 
what I want).


What am I missing?

Thanks,

- Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--

*Chris Bamford*
Senior Development Engineer 

/Email / MSN/   [EMAIL PROTECTED]
/Tel/   +44 (0)1344 381814  /Skype/ c.bamford

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

package Experimental;


import java.io.IOException;
import java.util.LinkedList;

import junit.framework.TestCase;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;


public class TestRunQueries extends TestCase {

public void testQueries() throws Exception {

		RAMDirectory indexStore = new RAMDirectory();
		Query query = null;

		String docs[] = {
			"aaa bbb ccc",
			"aaa ddd eee",
			"aaa ddd fff",
			"aaa dee fff",
			"AAA fff ggg",
			"ggg hhh iii",
			"123 123 1z2",	// document containing z for 
	// WildcardQuery example.
			"999 123 123",	// document for fuzzy search.
			"9x9 123 123",	// document for fuzzy search.
			"99 123 123",	// document for fuzzy search.
			"xxx yyy zzz"
		};
		
		IndexWriter writer = new IndexWriter(indexStore, 
			new StandardAnalyzer(), true);
		for (int j = 0; j < docs.length; j++) {
			Document d = new Document();
			d.add(new Field("body", docs[j], Field.Store.YES, Field.Index.UN_TOKENIZED));
			writer.addDocument(d);
		}
		writer.close();

		IndexReader indexReader = IndexReader.open(indexStore);

		System.out.println("\n** PhraseQuery Example **");
		System.out.println("NOTE: 2 documents are found " 
			+ "with 'aaa ddd' in order.");
		System.out.println("--" 
			+ "");
		PhraseQuery pq = new  PhraseQuery();		
		pq.add(new Term("body", "aaa"));
		pq.add(new Term("body", "ddd"));
		TestRunQueries.runQueryAndDisplayResults(indexStore, pq);

		System.out.println("");
		System.out.println("NOTE: ZERO documents are" 
			+ " found with 'xxx ddd' in order.");
		System.out.println("---" 
			+ "---");
		pq = new  PhraseQuery();		
		pq.add(new Term("body", "xxx"));
		pq.add(new Term("body", "ddd"));
		TestRunQueries.runQueryAndDisplayResults(indexStore, pq);
	}

	public static void runQueryAndDisplayResults(Directory indexStore, Query q) throws IOException {
		IndexSearcher searcher = new IndexSearcher(indexStore);

System.out.println("runQueryAndDisplayResults: query = " + q.toString());
		Hits hits = searcher.search(q);
		int _length = hits.length();
		System.out.println("HITS: " + _length);
		for (int i = 0; i < _length; i++) {
			Document doc = hits.doc(i);
			Field field = doc.getField("body");
			System.out.println("  value: " + field.stringValue());
		}
		searcher.close();
	}
}-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can we update a field on the current index

2008-07-10 Thread Michael McCandless


Yes you must delete the entire document and then re-index a new one,  
to update a single Field.


There is some work underway, or at least a Jira issue opened, towards  
improving this situation, here:


https://issues.apache.org/jira/browse/LUCENE-1231

But it will be some time before that's available.

Mike

Aditi Goyal wrote:


Hi,

I want to modify a field on the current index. Can it be done?
For what I have heard that we cannot update the index . It has to be
reindexed by deleting and then indexing again.


Thanks,
Aditi



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for instances within a document

2008-07-10 Thread jnance

Yes, the term frequency vector is exactly what I needed. Thanks!

-James


Ajay Lakhani wrote:
> 
> Hi James,
> 
> Try this:
> 
> Searcher searcher = new IndexSearcher(dir);
> QueryParser parser = new QueryParser("content", new
> StandardAnalyzer());
> Query query = parser.parse(queryString);
> 
> HashSet queryTerms = new HashSet();
> query.extractTerms(queryTerms);
> 
> Hits hits = searcher.search(query);
> 
> IndexReader reader = IndexReader.open(dir);
> 
> for (int i =0; i < hits.length() ; i ++){
>   Document d = hits.doc(i);
>   Field fid = d.getField("id");
>   Field ftitle = d.getField("title");
>   System.out.println("id is " + fid.stringValue());
>   System.out.println("title is " + ftitle.stringValue());
> 
>   TermFreqVector tfv = reader.getTermFreqVector(hits.id(i),
> "content");
>   String[] terms = tfv.getTerms();
>   int [] freqs = tfv.getTermFrequencies();//get the frequencies
> 
>   // for each term in the query
>   for (Iterator iter = queryTerms.iterator(); iter.hasNext();) {
> Term term = (Term) iter.next();
> 
> // for each term in the vector
> for (int j = 0; j < terms.length; j++) {
>   if (terms[j].equals(term.text())) {
> System.out.println("frequency of term ["+ term.text() +"] is "
> +
> freqs[j] );
>   }
> }
>   }
> }
> 
> Let me know if this helps.
> Cheers
> AJ
> 
> 2008/7/10 Karl Wettin <[EMAIL PROTECTED]>:
> 
>> Maybe you are looking for the document TermFreqVector?
>>
>>
>>   karl
>>
>> 9 jul 2008 kl. 15.49 skrev jnance:
>>
>>
>>> Hi,
>>>
>>> I am indexing lots of text files and need to see how many times a
>>> certain
>>> word comes up in each text file. Right now I have this constructor for
>>> "search":
>>>
>>> static void search(Searcher searcher, String queryString) throws
>>> ParseException, IOException {
>>> QueryParser parser = new QueryParser("content", new
>>> StandardAnalyzer());
>>> Query query = parser.parse(queryString);
>>> Hits hits = searcher.search(query);
>>>
>>> int hitCount = hits.length();
>>> if (hitCount == 0) {
>>> System.out.println("0 documents contain the word
>>> \"" + queryString +
>>> ".\"");
>>> }
>>> else {
>>> System.out.println(hitCount + " documents
>>> contain
>>> the word \"" +
>>> queryString + ".\"");
>>> }
>>> }
>>>
>>> This tells me how many documents contain the word I'm looking for... but
>>> how
>>> do I get it to tell me how many times the word occurs within that
>>> document?
>>>
>>> Thanks,
>>>
>>> James
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18381743.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Best practice for updating an index when reindexing is not an option

2008-07-10 Thread Christopher Kolstad
Hi.

Currently using Lucene 2.3.2 in a tomcat webapp. We have an action
configured that performs reindexing on our staging server. However, our live
server can not reindex since it does not have the necessary dtd files to
process the xml.

To update the index on the live server we perform a subversion update on the
lucene index directory.
Unfortunately this makes it necessary to stop the IndexSearcher while the
SubversionUpdate is doing its thing.

Presently we've had a requirement from our customer to not disable search.

So my idea was to copy the index directory to another directory and then
switch the IndexSearcher from the original index directory to the temporary
directory.
Then perform the Subversion update, and when done, switch the IndexSearcher
back to the original (now, updated) index directory.

Does anyone have any other suggestions on how to update the index directory
from subversion without having to disable the IndexSearcher?

BR
Christopher

-- 
Regards,
Christopher Kolstad
=
|100 little bugs in the code, debug one, |
|recompile, 101 little bugs in the code |
=

E-mail: [EMAIL PROTECTED] (University)
[EMAIL PROTECTED] (Home)
[EMAIL PROTECTED] (Job)


Re: Payloads and SpanScorer

2008-07-10 Thread Grant Ingersoll

I'm not fully following what you want.  Can you explain a bit more?

Thanks,
Grant

On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:

If a SpanQuery is constructed from one or more BoostingTermQuery(s),  
the
payloads on the terms are never processed by the SpanScorer. It  
seems to me
that you would want the SpanScorer to score the document both on the  
spans
distance and the payload score. So, either the SpanScorer would have  
to

process the payloads (duplicating the code in BoostingSpanScorer), or
perhaps SpanScorer could access the BoostingSpanScorers, or maybe  
there's

another approach.

Any thoughts on how to accomplish this?

Peter


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
Suppose I create a SpanNearQuery phrase with the terms "long range missiles"
and some slop factor. Each term is actually a BoostingTermQuery. Currently,
the score computed by SpanNearQuery.SpanScorer is based on the sloppy
frequency of the terms and their weights (this is fine). But even though
each term is actually a BoostingTermQuery, the BoostingTermScorer (and
therefore 'processPayload') is never invoked for this type of query.

I was looking for a way to have SpanNearQuery (also SpanOrQuery,
SpanFirstQuery) recognize that the terms in the phrase should boost the
overall score based on the payloads assigned to them. Thus the score from
the SpanNearQuery would be higher if :

a) the terms have payloads that boost their scores
b) the terms are positionally next to each other (minimal slop - as it works
now)


Does this make sense?

Peter

On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> I'm not fully following what you want.  Can you explain a bit more?
>
> Thanks,
> Grant
>
>
> On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:
>
>  If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
>> payloads on the terms are never processed by the SpanScorer. It seems to
>> me
>> that you would want the SpanScorer to score the document both on the spans
>> distance and the payload score. So, either the SpanScorer would have to
>> process the payloads (duplicating the code in BoostingSpanScorer), or
>> perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's
>> another approach.
>>
>> Any thoughts on how to accomplish this?
>>
>> Peter
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Best practice for updating an index when reindexing is not an option

2008-07-10 Thread Michael McCandless


Why does SubversionUpdate require shutting down the IndexSearcher?   
What goes wrong?


You might want to switch instead to rsync.

A Lucene index is fundamentally write once, so, syncing changes over  
should simply be copying over new files and removing now-deleted  
files.  You won't be able to remove files held open by the  
IndexSearcher, but, once the IndexSearcher restarts you'd then be able  
to delete those files on the next sync.


Mike

Christopher Kolstad wrote:


Hi.

Currently using Lucene 2.3.2 in a tomcat webapp. We have an action
configured that performs reindexing on our staging server. However,  
our live
server can not reindex since it does not have the necessary dtd  
files to

process the xml.

To update the index on the live server we perform a subversion  
update on the

lucene index directory.
Unfortunately this makes it necessary to stop the IndexSearcher  
while the

SubversionUpdate is doing its thing.

Presently we've had a requirement from our customer to not disable  
search.


So my idea was to copy the index directory to another directory and  
then
switch the IndexSearcher from the original index directory to the  
temporary

directory.
Then perform the Subversion update, and when done, switch the  
IndexSearcher

back to the original (now, updated) index directory.

Does anyone have any other suggestions on how to update the index  
directory

from subversion without having to disable the IndexSearcher?

BR
Christopher

--
Regards,
Christopher Kolstad
=
|100 little bugs in the code, debug one, |
|recompile, 101 little bugs in the code |
=

E-mail: [EMAIL PROTECTED] (University)
[EMAIL PROTECTED] (Home)
[EMAIL PROTECTED] (Job)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: .fdt file

2008-07-10 Thread Yonik Seeley
On Thu, Jul 10, 2008 at 1:42 AM, blazingwolf7 <[EMAIL PROTECTED]> wrote:
> Well, I am trying to extract the URL and contentLength from the ".fdt" file.
> I am planning to use both of these values in a filter to remove certain
> links to be display in the search result. The problem is, I am told not to
> use the IndexReader to retrieve these values for each document found
> matching with the query.
>
> So now, instead, I will have to retrieve the entire .fdt file, extract both
> the values and store it into an arraylist which will be use later.  I am
> having problem extracting the entire file without using all the seek()
> method to determine the position of the document.
>
> Any suggestion?

You're trying to do things at too low of a level (bypassing Lucene's
public APIs)
I suggested earlier that you index the URL untokenized, and then use
the FieldCache.  That will allow you to easily retrieve a String[] of
all the URLs.

-Yonik


> Yonik Seeley wrote:
>>
>> On Wed, Jul 9, 2008 at 11:13 PM, blazingwolf7 <[EMAIL PROTECTED]>
>> wrote:
>>> Sorry,but I am still quite new to Lucene. What exactly is "cp"?
>>
>> The unix command for copy (hence the smiley).
>>
>> Some of your recent questions seem to be suffering from an XY problem:
>> http://www.perlmonks.org/index.pl?node_id=542341
>> You may get more help by explaining what you are trying to do.
>>
>> -Yonik
>>
>>> Yonik Seeley wrote:

 On Wed, Jul 9, 2008 at 9:01 PM, blazingwolf7 <[EMAIL PROTECTED]>
 wrote:
> I had recently found out that Lucene will retrieve the content of a
> document
> from a file ".fdt". I am trying to retrieve the entire file in one go
> instead of retrieving it based on document number. can it be done?

 "cp" can retrieve the file on one go ;-)

 Other than that, the format is documented here:
 http://lucene.apache.org/java/docs/fileformats.html
 But I'm not sure why retrieving by document number won't work for you.

 -Yonik
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/.fdt-file-tp18373913p18376301.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: performance feedback

2008-07-10 Thread Beard, Brian
Currently the default setting is being used with our setup, so
autoCommit is true. I'll set this to false to see if it improves.

Question: If autoCommit is false, does this apply to optimization also,
so that during an hour long optimization that gets killed in the middle,
will the index be in the left in the initial state before optimization
started?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, July 09, 2008 12:06 PM
To: java-user@lucene.apache.org
Subject: Re: performance feedback

On Wed, Jul 9, 2008 at 11:35 AM, Beard, Brian <[EMAIL PROTECTED]>
wrote:
> I will try tweaking RAM, and check about autoCommit=false. It's on the
> future agenda to multi-thread through the index writer. The indexing
> time I quoted includes the document creation time which would
definitely
> improve with multi-threading.
>
> I'm doing batch updates of up to 1000 a pop, and closing and
re-opening
> the IndexWriter in between.

autoCommit=false will definitely help, and there is normally no reason
not to use it.
Bigger batches (or a single batch) will also help indexing speed.  A
single IndexWriter session can now avoid copying stored fields on
segment merges.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: newbie question (for John Griffin) - fixed

2008-07-10 Thread Chris Bamford

Hi John,

Please ignore my earlier questions on this subject, as I have got to the 
bottom of it.
I was not passing each word in the phrase as a separate Term to the 
query; instead I was passing the whole string (doh!).


Thanks.

- Chris

Chris Bamford wrote:

Hi John,

Further to my question below, I did some back-to-basics investigation 
of PhraseQueries and found that even basic ones fail for me...
I found the attached code on the Internet (see 
http://affy.blogspot.com/2003/04/codebit-examples-for-all-of-lucenes.html) 
and this fails too...  Can you explain why?  I would expect the first 
test to deliver 2 hits.


I have tried with Lucene 2.0 and 2.3.2 jars and both fail.

Thanks again,

- Chris



Chris Bamford wrote:

Hi John,

Just continuing from an earlier question where I asked you how to 
handle strings like "from:fred flintston*" (sorry I have lost the 
original email).
You advised me to write my own BooleanQuery and add to it Prefix- / 
Term- / Phrase- Querys as appropriate.  I have done so, but am having 
trouble with the result - my PhraseQueries just do not get any hits 
at all  :-(
My code looks for quotes - if it finds them, it treats the quoted 
phrase as a PhraseQuery and sets the slop factor to 0.

so,  an input of:

   subject:"Good Morning"

results in a PhraseQuery (which I add to my BooleanQuery and then 
dump with toString()) of:


   +subject:"good morning"

... which fails.
However, if I break it into 2 TermQuerys, it works (but that's not 
what I want).


What am I missing?

Thanks,

- Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--

*Chris Bamford*
Senior Development Engineer 

/Email / MSN/   [EMAIL PROTECTED]
/Tel/   +44 (0)1344 381814  /Skype/ c.bamford


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: performance feedback

2008-07-10 Thread Yonik Seeley
On Thu, Jul 10, 2008 at 11:13 AM, Beard, Brian <[EMAIL PROTECTED]> wrote:
> Question: If autoCommit is false, does this apply to optimization also,
> so that during an hour long optimization that gets killed in the middle,
> will the index be in the left in the initial state before optimization
> started?

Yes.  But the longest merge is the biggest, so that would probably
happen almost as often with autoCommit=true too.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sorting case-insensitively

2008-07-10 Thread Paul J. Lucas

On Jul 9, 2008, at 10:14 PM, Chris Hostetter wrote:

I'm going to guess you have a doc where that field doesn't have a  
value.

ordinarily that's fine, but maybe SortComparator doesn't handle
that case very well.


But how does the built-in STRING sort work with null values then?  And  
how do I make a SortComparitor that works?



what's the full stack trace look like?


See below.

- Paul


java.lang.NullPointerException
	at org.apache.lucene.search.SortComparator 
$1.compare(SortComparator.java:54)
	at  
org 
.apache 
.lucene.search.FieldSortedHitQueue.lessThan(FieldSortedHitQueue.java: 
125)

at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:139)
at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:53)
	at  
org 
.apache 
.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:78)
	at  
org 
.apache 
.lucene 
.search 
.FieldSortedHitQueue.insertWithOverflow(FieldSortedHitQueue.java:108)
	at  
org 
.apache 
.lucene.search.TopFieldDocCollector.collect(TopFieldDocCollector.java: 
61)

at org.apache.lucene.search.TermScorer.score(TermScorer.java:76)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:61)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
146)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
124)

at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:100)
at org.apache.lucene.search.Hits.(Hits.java:77)
at org.apache.lucene.search.Searcher.search(Searcher.java:55)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads and SpanScorer

2008-07-10 Thread Grant Ingersoll
Makes sense.  It was always my intent to implement things like  
PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning


I think it would make sense to develop these and I would be happy to  
help shepherd a patch through, but am not in a position to generate  
said patch at this moment in time.


On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote:

Suppose I create a SpanNearQuery phrase with the terms "long range  
missiles"
and some slop factor. Each term is actually a BoostingTermQuery.  
Currently,

the score computed by SpanNearQuery.SpanScorer is based on the sloppy
frequency of the terms and their weights (this is fine). But even  
though

each term is actually a BoostingTermQuery, the BoostingTermScorer (and
therefore 'processPayload') is never invoked for this type of query.

I was looking for a way to have SpanNearQuery (also SpanOrQuery,
SpanFirstQuery) recognize that the terms in the phrase should boost  
the
overall score based on the payloads assigned to them. Thus the score  
from

the SpanNearQuery would be higher if :

a) the terms have payloads that boost their scores
b) the terms are positionally next to each other (minimal slop - as  
it works

now)


Does this make sense?

Peter

On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:


I'm not fully following what you want.  Can you explain a bit more?

Thanks,
Grant


On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:

If a SpanQuery is constructed from one or more  
BoostingTermQuery(s), the
payloads on the terms are never processed by the SpanScorer. It  
seems to

me
that you would want the SpanScorer to score the document both on  
the spans
distance and the payload score. So, either the SpanScorer would  
have to
process the payloads (duplicating the code in BoostingSpanScorer),  
or
perhaps SpanScorer could access the BoostingSpanScorers, or maybe  
there's

another approach.

Any thoughts on how to accomplish this?

Peter



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
I may take a crack at this. Any more thoughts you may have on the
implementation are welcome, but I don't want to distract you too much.

Thanks,
Peter


On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> Makes sense.  It was always my intent to implement things like
> PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning
>
> I think it would make sense to develop these and I would be happy to help
> shepherd a patch through, but am not in a position to generate said patch at
> this moment in time.
>
>
> On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote:
>
>  Suppose I create a SpanNearQuery phrase with the terms "long range
>> missiles"
>> and some slop factor. Each term is actually a BoostingTermQuery.
>> Currently,
>> the score computed by SpanNearQuery.SpanScorer is based on the sloppy
>> frequency of the terms and their weights (this is fine). But even though
>> each term is actually a BoostingTermQuery, the BoostingTermScorer (and
>> therefore 'processPayload') is never invoked for this type of query.
>>
>> I was looking for a way to have SpanNearQuery (also SpanOrQuery,
>> SpanFirstQuery) recognize that the terms in the phrase should boost the
>> overall score based on the payloads assigned to them. Thus the score from
>> the SpanNearQuery would be higher if :
>>
>> a) the terms have payloads that boost their scores
>> b) the terms are positionally next to each other (minimal slop - as it
>> works
>> now)
>>
>>
>> Does this make sense?
>>
>> Peter
>>
>> On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll <[EMAIL PROTECTED]>
>> wrote:
>>
>>  I'm not fully following what you want.  Can you explain a bit more?
>>>
>>> Thanks,
>>> Grant
>>>
>>>
>>> On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:
>>>
>>> If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
>>>
 payloads on the terms are never processed by the SpanScorer. It seems to
 me
 that you would want the SpanScorer to score the document both on the
 spans
 distance and the payload score. So, either the SpanScorer would have to
 process the payloads (duplicating the code in BoostingSpanScorer), or
 perhaps SpanScorer could access the BoostingSpanScorers, or maybe
 there's
 another approach.

 Any thoughts on how to accomplish this?

 Peter


>>> --
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Sorting case-insensitively

2008-07-10 Thread Chris Hostetter

: But how does the built-in STRING sort work with null values then?  And how do
: I make a SortComparitor that works?

Built in string sorting uses FieldCache.DEFAULT.getStringIndex() ... any 
doc without a value ends up without an assignment in StringIndex.order[], 
so it gets the default value of 0.  In most cases only the order[] needs 
to be consulted and the ints are compared -- in the case of a 
RemoteSearcher or something like it, the StringIndex.lookup[] can be 
consulted using the index from StringIndex.order[] -- and the 0th slot 
of StringIndex.lookup[] is null.  
(see FieldSortedHitQueue.comparatorString to see what i mean)

But that doesn't really relate to what you're doing, because it doesn't 
even deal with the SortComparator class -- it returns an 
anonymous direct subclass of ScoreDocComparator.

I suspect the ScoreDocComparator in SortComparator is buggy in that it 
assumes there will be a cachedValue for every ScoreDoc -- even though 
there is no garuntee of that.  if you could submit a test case that 
reproduces this using a trivial subclass (just return the orriginal String 
as the Comparable) that can help us verify the bug and the fix.

Assuming i'm right, I don'treally have any good work arround suggestion 
for you beyond overriding newComparator() in your SortComparator subclass 
to explicitly test for null yourself.

: > what's the full stack trace look like?
: 
: See below.

thanks ... i just wanted to double check there wasn't some a-typical code 
path involved (RemoteMultiSearcher or anything)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to get IndexReader Remote?

2008-07-10 Thread Chris Hostetter

: I have a MultiSearcher from remote using by 
: Naming.bind("rmi://"+IP+":"+PORT+"/"+NAME, RemoteSearchable)
: ,but MultiSearcher  doesn't has getIndexReader() .
: How to get IndexReader?

It's not possible to get a remote IndexReader ... that's the main 
distinction between the Searchable interface and the Searcher class -- the 
methods availabel in Searchable can be (decently) streamed across the wire 
-- but returning a whole IndexReader remotely would be ... insane ... in 
most cases.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boolean expression for no terms OR matching a wildcard

2008-07-10 Thread Ronald Rudy
I need to perform a query for a term that may or may not have values,  
and I need to check for the conditions where either no terms are  
indexed OR any and ALL indexed terms match a wildcard.


For example, say the following values were indexed as terms in the  
field "myfield" in the three documents:


1) terms "abc123" and "abcdef123"
2) terms "abc123", "def123" and "abcdef123"
3) no terms

I want my query with a wildcard search of "+myfield:abc*123" to match  
on both 1 and 3 but NOT 2.


Is this possible?

Thanks,
  - Ron

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to get total hit count for each Searchable?

2008-07-10 Thread xin liu
Hi,
I have individual index files for Audio, Image and PDF files. We build common 
meta fields for them. When I search for a string, I want the search defaults to 
return mixed search results from these 3 different index based on relevancy. 
But I also wants to know hit count for each individual index type. For example, 
I want to get:
Mixed together total hit count: 105, with the first 10 HitItem.
Total hit in Audio: 73
Total hit in Image: 17
Total hit in PDF:15

Right now, I'm doing the following way:
1. Gets one Searchable instance for Audio, one for Image, and one for PDF index;
2. construct ParallelMultiSearcher s with above 3 Searchable as parameters; 
call its search to get total hit count and first 10 hit items;
3. Call Audio searchable to get total hit count in Audio;
4. Call Image searchable to get total hit count for Image;
5. Call PDF searchable to get total hit count for Image.

So, Lucene will need do 6 search operations for these 3 index. Definitely, the 
performance will be an issue. 

Any experts can give me some advice? Thanks!

Tony



   

Re: .fdt file

2008-07-10 Thread blazingwolf7

Thanks. I think I will follow the advice. But just for the sack of curiosity,
can what I suggest be done ?


Yonik Seeley wrote:
> 
> On Thu, Jul 10, 2008 at 1:42 AM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
>> Well, I am trying to extract the URL and contentLength from the ".fdt"
>> file.
>> I am planning to use both of these values in a filter to remove certain
>> links to be display in the search result. The problem is, I am told not
>> to
>> use the IndexReader to retrieve these values for each document found
>> matching with the query.
>>
>> So now, instead, I will have to retrieve the entire .fdt file, extract
>> both
>> the values and store it into an arraylist which will be use later.  I am
>> having problem extracting the entire file without using all the seek()
>> method to determine the position of the document.
>>
>> Any suggestion?
> 
> You're trying to do things at too low of a level (bypassing Lucene's
> public APIs)
> I suggested earlier that you index the URL untokenized, and then use
> the FieldCache.  That will allow you to easily retrieve a String[] of
> all the URLs.
> 
> -Yonik
> 
> 
>> Yonik Seeley wrote:
>>>
>>> On Wed, Jul 9, 2008 at 11:13 PM, blazingwolf7 <[EMAIL PROTECTED]>
>>> wrote:
 Sorry,but I am still quite new to Lucene. What exactly is "cp"?
>>>
>>> The unix command for copy (hence the smiley).
>>>
>>> Some of your recent questions seem to be suffering from an XY problem:
>>> http://www.perlmonks.org/index.pl?node_id=542341
>>> You may get more help by explaining what you are trying to do.
>>>
>>> -Yonik
>>>
 Yonik Seeley wrote:
>
> On Wed, Jul 9, 2008 at 9:01 PM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
>> I had recently found out that Lucene will retrieve the content of a
>> document
>> from a file ".fdt". I am trying to retrieve the entire file in one go
>> instead of retrieving it based on document number. can it be done?
>
> "cp" can retrieve the file on one go ;-)
>
> Other than that, the format is documented here:
> http://lucene.apache.org/java/docs/fileformats.html
> But I'm not sure why retrieving by document number won't work for you.
>
> -Yonik
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/.fdt-file-tp18373913p18376301.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/.fdt-file-tp18373913p18394786.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: .fdt file

2008-07-10 Thread Grant Ingersoll


On Jul 10, 2008, at 1:42 AM, blazingwolf7 wrote:



Well, I am trying to extract the URL and contentLength from the  
".fdt" file.
I am planning to use both of these values in a filter to remove  
certain
links to be display in the search result. The problem is, I am told  
not to

use the IndexReader to retrieve these values for each document found
matching with the query.


Are you implying that using the IR would solve your problem, but for  
some reason you're architect (or whatever you call the person making  
the decisions) told you not to?  If so, can you explain more the  
reasoning?





So now, instead, I will have to retrieve the entire .fdt file,  
extract both
the values and store it into an arraylist which will be use later.   
I am

having problem extracting the entire file without using all the seek()
method to determine the position of the document.

Any suggestion?


Yonik Seeley wrote:


On Wed, Jul 9, 2008 at 11:13 PM, blazingwolf7  
<[EMAIL PROTECTED]>

wrote:

Sorry,but I am still quite new to Lucene. What exactly is "cp"?


The unix command for copy (hence the smiley).

Some of your recent questions seem to be suffering from an XY  
problem:

http://www.perlmonks.org/index.pl?node_id=542341
You may get more help by explaining what you are trying to do.

-Yonik


Yonik Seeley wrote:


On Wed, Jul 9, 2008 at 9:01 PM, blazingwolf7 <[EMAIL PROTECTED] 
>

wrote:
I had recently found out that Lucene will retrieve the content  
of a

document
from a file ".fdt". I am trying to retrieve the entire file in  
one go

instead of retrieving it based on document number. can it be done?


"cp" can retrieve the file on one go ;-)

Other than that, the format is documented here:
http://lucene.apache.org/java/docs/fileformats.html
But I'm not sure why retrieving by document number won't work for  
you.


-Yonik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: 
http://www.nabble.com/.fdt-file-tp18373913p18376301.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: .fdt file

2008-07-10 Thread blazingwolf7

Well, according to him, using the reader to access the index everytime a
document is found to retrieve certain values is inefficient. Meaning if
there is 500k document, the index will be access 500k times. It might affect
the performance of the search.

So I am instructed to retrieve all the necessary values at the beginning of
the search and store it. Later the values will be retrieve from there. I am
cracking my head trying to do that%-|


Grant Ingersoll-6 wrote:
> 
> 
> On Jul 10, 2008, at 1:42 AM, blazingwolf7 wrote:
> 
>>
>> Well, I am trying to extract the URL and contentLength from the  
>> ".fdt" file.
>> I am planning to use both of these values in a filter to remove  
>> certain
>> links to be display in the search result. The problem is, I am told  
>> not to
>> use the IndexReader to retrieve these values for each document found
>> matching with the query.
> 
> Are you implying that using the IR would solve your problem, but for  
> some reason you're architect (or whatever you call the person making  
> the decisions) told you not to?  If so, can you explain more the  
> reasoning?
> 
>>
>>
>> So now, instead, I will have to retrieve the entire .fdt file,  
>> extract both
>> the values and store it into an arraylist which will be use later.   
>> I am
>> having problem extracting the entire file without using all the seek()
>> method to determine the position of the document.
>>
>> Any suggestion?
>>
>>
>> Yonik Seeley wrote:
>>>
>>> On Wed, Jul 9, 2008 at 11:13 PM, blazingwolf7  
>>> <[EMAIL PROTECTED]>
>>> wrote:
 Sorry,but I am still quite new to Lucene. What exactly is "cp"?
>>>
>>> The unix command for copy (hence the smiley).
>>>
>>> Some of your recent questions seem to be suffering from an XY  
>>> problem:
>>> http://www.perlmonks.org/index.pl?node_id=542341
>>> You may get more help by explaining what you are trying to do.
>>>
>>> -Yonik
>>>
 Yonik Seeley wrote:
>
> On Wed, Jul 9, 2008 at 9:01 PM, blazingwolf7 <[EMAIL PROTECTED] 
> >
> wrote:
>> I had recently found out that Lucene will retrieve the content  
>> of a
>> document
>> from a file ".fdt". I am trying to retrieve the entire file in  
>> one go
>> instead of retrieving it based on document number. can it be done?
>
> "cp" can retrieve the file on one go ;-)
>
> Other than that, the format is documented here:
> http://lucene.apache.org/java/docs/fileformats.html
> But I'm not sure why retrieving by document number won't work for  
> you.
>
> -Yonik
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/.fdt-file-tp18373913p18376301.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/.fdt-file-tp18373913p18395519.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: newbie question (for John Griffin)

2008-07-10 Thread John Griffin
Chris,

The code you refer to in the blog is 5 years old! Some of the code is no
longer valid with the newer Lucene jars. I wouldn't use it to test anything.


My suspicion is that your index itself is suspect. Let's see the code you
use to build the index with a small data set that will show what you are
trying to accomplish.

BUT FIRST! Look at your built index with Luke before doing this to make sure
that what you THINK you have in your index is really what you have.

Luke is at http://www.getopt.org/luke/. This is probably THE most important
tool you'll have in your arsenal and is pretty easy to use. You can query
your index with it and see if it responds the way you think it should. You
can enter your subject:"Good Morning" query and see what happens. If Luke
can't find what you're querying for then your code won't. 

John G.


-Original Message-
From: Chris Bamford [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 10, 2008 5:58 AM
To: java-user@lucene.apache.org
Subject: Re: newbie question (for John Griffin)

Hi John,

Further to my question below, I did some back-to-basics investigation of 
PhraseQueries and found that even basic ones fail for me...
I found the attached code on the Internet (see 
http://affy.blogspot.com/2003/04/codebit-examples-for-all-of-lucenes.html) 
and this fails too...  Can you explain why?  I would expect the first 
test to deliver 2 hits.

I have tried with Lucene 2.0 and 2.3.2 jars and both fail.

Thanks again,

- Chris



Chris Bamford wrote:
> Hi John,
>
> Just continuing from an earlier question where I asked you how to 
> handle strings like "from:fred flintston*" (sorry I have lost the 
> original email).
> You advised me to write my own BooleanQuery and add to it Prefix- / 
> Term- / Phrase- Querys as appropriate.  I have done so, but am having 
> trouble with the result - my PhraseQueries just do not get any hits at 
> all  :-(
> My code looks for quotes - if it finds them, it treats the quoted 
> phrase as a PhraseQuery and sets the slop factor to 0.
> so,  an input of:
>
>subject:"Good Morning"
>
> results in a PhraseQuery (which I add to my BooleanQuery and then dump 
> with toString()) of:
>
>+subject:"good morning"
>
> ... which fails.
> However, if I break it into 2 TermQuerys, it works (but that's not 
> what I want).
>
> What am I missing?
>
> Thanks,
>
> - Chris
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-- 

*Chris Bamford*
Senior Development Engineer 

/Email / MSN/   [EMAIL PROTECTED]
/Tel/   +44 (0)1344 381814  /Skype/ c.bamford



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: newbie question (for John Griffin) - fixed

2008-07-10 Thread John Griffin
Chris,

-Original Message-
From: Chris Bamford [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 10, 2008 9:15 AM
To: java-user@lucene.apache.org
Subject: Re: newbie question (for John Griffin) - fixed

Hi John,

Please ignore my earlier questions on this subject, as I have got to the 
bottom of it.
I was not passing each word in the phrase as a separate Term to the 
query; 

==I'm not sure I understand. You want a phrase query so they should be
==passed as a phrase in quotes.


instead I was passing the whole string (doh!).

Thanks.

- Chris

Chris Bamford wrote:
> Hi John,
>
> Further to my question below, I did some back-to-basics investigation 
> of PhraseQueries and found that even basic ones fail for me...
> I found the attached code on the Internet (see 
> http://affy.blogspot.com/2003/04/codebit-examples-for-all-of-lucenes.html)

> and this fails too...  Can you explain why?  I would expect the first 
> test to deliver 2 hits.
>
> I have tried with Lucene 2.0 and 2.3.2 jars and both fail.
>
> Thanks again,
>
> - Chris
>
>
>
> Chris Bamford wrote:
>> Hi John,
>>
>> Just continuing from an earlier question where I asked you how to 
>> handle strings like "from:fred flintston*" (sorry I have lost the 
>> original email).
>> You advised me to write my own BooleanQuery and add to it Prefix- / 
>> Term- / Phrase- Querys as appropriate.  I have done so, but am having 
>> trouble with the result - my PhraseQueries just do not get any hits 
>> at all  :-(
>> My code looks for quotes - if it finds them, it treats the quoted 
>> phrase as a PhraseQuery and sets the slop factor to 0.
>> so,  an input of:
>>
>>subject:"Good Morning"
>>
>> results in a PhraseQuery (which I add to my BooleanQuery and then 
>> dump with toString()) of:
>>
>>+subject:"good morning"
>>
>> ... which fails.
>> However, if I break it into 2 TermQuerys, it works (but that's not 
>> what I want).
>>
>> What am I missing?
>>
>> Thanks,
>>
>> - Chris
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> 
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-- 

*Chris Bamford*
Senior Development Engineer 

/Email / MSN/   [EMAIL PROTECTED]
/Tel/   +44 (0)1344 381814  /Skype/ c.bamford


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Deletions

2008-07-10 Thread John Griffin
Guys (and Gals),

 

A question on index deletions, what exactly happens to the Lucene document
numbers in an index when a document is deleted? Let's say I have a 5 doc
index.

 

Document #  Doc

0  doc1 

1  doc2

2  doc3

3  doc4

4  doc5

 

If doc 2 is deleted, is this what I'm left with?

 

Document #  Doc

0  doc1 

1  doc2

2  doc4

3  doc5

 

This is my assumption. If not, what DOES happen?

 

TIA

 

John G.

 

 

 



Re: Can we update a field on the current index

2008-07-10 Thread Aditi Goyal
Thanks Mike for your valuable time.

Regards,
Aditi

On Thu, Jul 10, 2008 at 5:36 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> Yes you must delete the entire document and then re-index a new one, to
> update a single Field.
>
> There is some work underway, or at least a Jira issue opened, towards
> improving this situation, here:
>
>https://issues.apache.org/jira/browse/LUCENE-1231
>
> But it will be some time before that's available.
>
> Mike
>
>
> Aditi Goyal wrote:
>
>  Hi,
>>
>> I want to modify a field on the current index. Can it be done?
>> For what I have heard that we cannot update the index . It has to be
>> reindexed by deleting and then indexing again.
>>
>>
>> Thanks,
>> Aditi
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Deletions

2008-07-10 Thread Anshum
Hi John,

In case of deletions, it is just a delayed delete. In other words, the doc
is just marked as deleted in the deletable file, leaving a void in the
numbering of docs. The actual shifting of document ids happens only when you
optimize the index. In that case the deletables file is used to physically
remove the docs from the index.

Hope that clears the doubt :)

--
Anshum Gupta
Naukri Labs!

On Fri, Jul 11, 2008 at 8:24 AM, John Griffin <[EMAIL PROTECTED]>
wrote:

> Guys (and Gals),
>
>
>
> A question on index deletions, what exactly happens to the Lucene document
> numbers in an index when a document is deleted? Let's say I have a 5 doc
> index.
>
>
>
> Document #  Doc
>
> 0  doc1
>
> 1  doc2
>
> 2  doc3
>
> 3  doc4
>
> 4  doc5
>
>
>
> If doc 2 is deleted, is this what I'm left with?
>
>
>
> Document #  Doc
>
> 0  doc1
>
> 1  doc2
>
> 2  doc4
>
> 3  doc5
>
>
>
> This is my assumption. If not, what DOES happen?
>
>
>
> TIA
>
>
>
> John G.
>
>
>
>
>
>
>
>


-- 
--
The facts expressed here belong to everybody, the opinions to me.
The distinction is yours to draw