SV: Find "latest" document (before a certain date)

2007-08-29 Thread Per Lindberg
 
> Från: Karl Wettin [mailto:[EMAIL PROTECTED] 
> 28 aug 2007 kl. 17.48 skrev Per Lindberg:
> 
> > Now, I want to search the content, and return only the
> > LATEST found document with each id. To complicate
> > things a bit, I want the latest before a given date. In other
> > words, for each id pick only the one with the highest date
> > less than x.
> 
> Given you added documents with version time stamp in chronological
> order, how about using a RangeQuery and pick the hit with the
> greatest document number?

Yep, that did the trick! There seems to be no Filter that can do
the final picking of the highest date, so I had to do that after the
search.

I use IndexSearcher.search with a RangeFilter,
I presume that's just as efficient as a RangeQuery?

Thanks!
Per



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SV: Find "latest" document (before a certain date)

2007-08-29 Thread tom
Tom Roberts is out of the office until 3rd September 2007 and will get back to 
you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: SV: Find "latest" document (before a certain date)

2007-08-29 Thread tom
Tom Roberts is out of the office until 3rd September 2007 and will get back to 
you on his return.

http://www.luxonline.org.uk
http://www.lux.org.uk




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexer / Searcher holding deleted files

2007-08-29 Thread Aleksander M. Stensby
Hello everyone. I have a system where an indexing-process is running  
several times a day, adding documents, and performing an optimize() at the  
end of every run.
In addition, we have a web-application (running in tomcat) that is used to  
perform searches on the index(es).


The problem (probably because of my lack of knowledge) is that when the  
indexer has performed its optimize routine, marking the files as (deleted)  
in the filesystem (a unix system), the files are not deleted, because  
tomcat is keeping the files locked... SO as you can all imagine, the lvm  
is ever growing... Problem of course solves itself through a  
Tomcat-restart, but that's not a very ideal solution to perform restarts  
every other day or so...


I presume its the IndexReader and/or IndexSearcher that is keeping the  
files locked for deletion (in the web-application). So, i was wondering if  
any of you have any input on how I can release the files (or actually just  
delete them after the optimize routine in the indexer right away...


Than you very much for any feedback!

--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexer / Searcher holding deleted files

2007-08-29 Thread Mark Miller

Reopen the Searchers/Readers that are holding the files open.

Aleksander M. Stensby wrote:
Hello everyone. I have a system where an indexing-process is running 
several times a day, adding documents, and performing an optimize() at 
the end of every run.
In addition, we have a web-application (running in tomcat) that is 
used to perform searches on the index(es).


The problem (probably because of my lack of knowledge) is that when 
the indexer has performed its optimize routine, marking the files as 
(deleted) in the filesystem (a unix system), the files are not 
deleted, because tomcat is keeping the files locked... SO as you can 
all imagine, the lvm is ever growing... Problem of course solves 
itself through a Tomcat-restart, but that's not a very ideal solution 
to perform restarts every other day or so...


I presume its the IndexReader and/or IndexSearcher that is keeping the 
files locked for deletion (in the web-application). So, i was 
wondering if any of you have any input on how I can release the files 
(or actually just delete them after the optimize routine in the 
indexer right away...


Than you very much for any feedback!



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Find "latest" document (before a certain date)

2007-08-29 Thread Karl Wettin


29 aug 2007 kl. 12.29 skrev Per Lindberg:


how about using a RangeQuery and pick the hit with the
greatest document number?


Yep, that did the trick! There seems to be no Filter that can do
the final picking of the highest date, so I had to do that after the
search.

I use IndexSearcher.search with a RangeFilter,
I presume that's just as efficient as a RangeQuery?


It depends, espescially on how you use reuse the filter.

Benchmark to be sure


--
kalle


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-08-29 Thread Mark Miller
It kind of is a contrib -- its really just a new Scorer class (with some 
axillary helper classes) for the old contrib Highlighter. Since the 
contrib Highlighter is pretty hardened at this point, I figured that was 
the best way to go. Or do you mean something different?


- Mark

Mike Klaas wrote:

Mark,

I'm still interested in integrating this into Solr--this is a feature 
that has been requested a few times.  It would be easier to do so if 
it were a contrib/...


thanks for the great work,
-Mike

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:

I am a bit unclear about your question. The patch you mention extends 
the original Highlighter to support phrase and span queries. It does 
not include any major performance increases over the original 
Highlighter (in fact, it takes a bit longer to Highlight a Span or 
Phrase query than it does to just highlight Terms).


Will it be released with the next version of Lucene? Doesn't look 
like it, but anything is possible. A few people are using it, but 
there has not been widespread interest that I have seen. My guess is 
that there are just not enough people trying to highlight Span 
queries -- which I'd blame on a lack of Span support in the default 
Lucene Query syntax.


Whether it is included soon or not, the code works well and I will 
continue to support it.


- Mark

Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, are 
these the

same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch 



-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:




I have not looked at any highlighting code yet. Is there already an


extension


of PhraseQuery that has getSpans() ?



Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery) 
query).getTerms();

int i;
SpanQuery[] clauses = new 
SpanQuery[phraseQueryTerms.length];


for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
}

SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit distance, but
it approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in the 
end.
Certainly, it would seem to require that you store offsets or you 
would

have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks. Ronnie's
Highlighter appears to be faster than the original due to two 
things: he

doesn't have to re-tokenize text and he rebuilds the original document
in large pieces. Depending on how you want to look at it, he loses 
most
of the speed gained from just looking at the Query tokens instead 
of all
tokens to pulling the Term offset information (which appears pretty 
slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in bigger
pieces i.e. instead of going through each token and adding the source
text that it covers, build up the offset information until you get
another hit and then pull from the source text into the highlighted 
text
in one big piece rather than a tokens worth at a time. Of course 
this is

not compatible with the way the Fragmenter currently works. If you use
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
wins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see in a
gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be as 
fast

as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and TokenSources is
certainly not worth it. It just takes too long to pull TermVector 
info.


- Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-08-29 Thread Mark Miller
The patch you refer to should include the javadoc/source code. If that 
is not sufficient, drop me a line privately and I will email you all of 
the source code / javadoc.


- Mark

Michael Stoppelman wrote:

Ah, much clearer now. It seems that the jar file is just the class files. Is
the source/javadoc code somewhere else?

-M

On 8/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:
  

I am a bit unclear about your question. The patch you mention extends
the original Highlighter to support phrase and span queries. It does not
include any major performance increases over the original Highlighter
(in fact, it takes a bit longer to Highlight a Span or Phrase query than
it does to just highlight Terms).

Will it be released with the next version of Lucene? Doesn't look like
it, but anything is possible. A few people are using it, but there has
not been widespread interest that I have seen. My guess is that there
are just not enough people trying to highlight Span queries -- which I'd
blame on a lack of Span support in the default Lucene Query syntax.

Whether it is included soon or not, the code works well and I will
continue to support it.

- Mark

Michael Stoppelman wrote:


Is this jar going to be in the next release of lucene? Also, are these
  

the


same as the changes in the following patch:

  

https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch


-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:

  

I have not looked at any highlighting code yet. Is there already an

  

extension



of PhraseQuery that has getSpans() ?


  

Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
int i;
SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length


];


for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
}

SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit distance, but
it approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in the end.
Certainly, it would seem to require that you store offsets or you would
have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks. Ronnie's
Highlighter appears to be faster than the original due to two things:


he


doesn't have to re-tokenize text and he rebuilds the original document
in large pieces. Depending on how you want to look at it, he loses most
of the speed gained from just looking at the Query tokens instead of


all


tokens to pulling the Term offset information (which appears pretty


slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in bigger
pieces i.e. instead of going through each token and adding the source
text that it covers, build up the offset information until you get
another hit and then pull from the source text into the highlighted


text


in one big piece rather than a tokens worth at a time. Of course this


is


not compatible with the way the Fragmenter currently works. If you use
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
wins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see in a
gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast
as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and TokenSources is
certainly not worth it. It just takes too long to pull TermVector info.

- Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

2007-08-29 Thread Per Lindberg

> Från: Karl Wettin [mailto:[EMAIL PROTECTED] 

> 29 aug 2007 kl. 12.29 skrev Per Lindberg:
> 
> >> how about using a RangeQuery and pick the hit with the
> >> greatest document number?
> >
> > Yep, that did the trick! There seems to be no Filter that can do
> > the final picking of the highest date, so I had to do that after the
> > search.
> >
> > I use IndexSearcher.search with a RangeFilter,
> > I presume that's just as efficient as a RangeQuery?
> 
> It depends, espescially on how you use reuse the filter.

For each search request (it's a webapp) I currently create
a new IndexSearcher, new Filter and new Sort, call
searcher.search(query, filter, sorter) and later searcher.close().

The literature says that it is desirable to cache the IndexSearcher,
but there's no mention of the memory cost! Since it is said to
take a long time to create, I presume that the IndexSearcher
reads the index files and keeps a lot of stuff in memory, so
the thought of caching one for each HttpSession gives me bad vibes.

(Also keeping open files; the file locking scheme in NTFS
can prevent Tomcat from doing hot redeploy if the webapp
has open files).

> Benchmark to be sure

So far searches with Lucene feel astonishingly fast! Yay! :-)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

2007-08-29 Thread Karl Wettin


29 aug 2007 kl. 14.32 skrev Per Lindberg:


For each search request (it's a webapp) I currently create
a new IndexSearcher, new Filter and new Sort, call
searcher.search(query, filter, sorter) and later searcher.close().


You really want to reuse the IndexSearcher until new data has
been added to the index. I suppose the same thing goes for filters
and perhaps even sorts?

Start here:

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


--
kalle



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

2007-08-29 Thread Patrick Turcotte
Hi,

Answers in the text.
> For each search request (it's a webapp) I currently create
> a new IndexSearcher, new Filter and new Sort, call
> searcher.search(query, filter, sorter) and later searcher.close().
>
> The literature says that it is desirable to cache the IndexSearcher,
> but there's no mention of the memory cost! Since it is said to
> take a long time to create, I presume that the IndexSearcher
> reads the index files and keeps a lot of stuff in memory, so
> the thought of caching one for each HttpSession gives me bad vibes.

Why don't you put into the context scope
[servletContext.setAttribute("index", IndexSearcher)] ?

You can have it initialized upon startup with init() and cleanup on
shutdown with destroy()

Hope this helps.

Patrick

>
> (Also keeping open files; the file locking scheme in NTFS
> can prevent Tomcat from doing hot redeploy if the webapp
> has open files).
>
> > Benchmark to be sure
>
> So far searches with Lucene feel astonishingly fast! Yay! :-)
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexer / Searcher holding deleted files

2007-08-29 Thread Aleksander M. Stensby

Hmm, yeah, well thats what I do now...
Shouldn't it be sufficient to do:

searcher.close();
(...)
searcher = new IndexSearcher(indexPath);


Or?

And maybe wrap that in
if(searcher.getIndexReader.hasDeletions())
and possibly (!searcher.getIndexReader.isCurrent())

thanks,
Aleksander


On Wed, 29 Aug 2007 13:15:33 +0200, Mark Miller <[EMAIL PROTECTED]>  
wrote:



Reopen the Searchers/Readers that are holding the files open.

Aleksander M. Stensby wrote:
Hello everyone. I have a system where an indexing-process is running  
several times a day, adding documents, and performing an optimize() at  
the end of every run.
In addition, we have a web-application (running in tomcat) that is used  
to perform searches on the index(es).


The problem (probably because of my lack of knowledge) is that when the  
indexer has performed its optimize routine, marking the files as  
(deleted) in the filesystem (a unix system), the files are not deleted,  
because tomcat is keeping the files locked... SO as you can all  
imagine, the lvm is ever growing... Problem of course solves itself  
through a Tomcat-restart, but that's not a very ideal solution to  
perform restarts every other day or so...


I presume its the IndexReader and/or IndexSearcher that is keeping the  
files locked for deletion (in the web-application). So, i was wondering  
if any of you have any input on how I can release the files (or  
actually just delete them after the optimize routine in the indexer  
right away...


Than you very much for any feedback!



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
[EMAIL PROTECTED]
Tlf.: +47 41 22 82 72

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexer / Searcher holding deleted files

2007-08-29 Thread Erick Erickson
I'd guess that you're not closing *all* of your searchers.
Which is reinforced somewhat by the fact that bouncing your
Tomcat instance cleans things up. Do you perhaps open
a reader in the initialization code that never gets closed?

Erick

On 8/29/07, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote:
>
> Hmm, yeah, well thats what I do now...
> Shouldn't it be sufficient to do:
>
> searcher.close();
> (...)
> searcher = new IndexSearcher(indexPath);
>
>
> Or?
>
> And maybe wrap that in
> if(searcher.getIndexReader.hasDeletions())
> and possibly (!searcher.getIndexReader.isCurrent())
>
> thanks,
> Aleksander
>
>
> On Wed, 29 Aug 2007 13:15:33 +0200, Mark Miller <[EMAIL PROTECTED]>
> wrote:
>
> > Reopen the Searchers/Readers that are holding the files open.
> >
> > Aleksander M. Stensby wrote:
> >> Hello everyone. I have a system where an indexing-process is running
> >> several times a day, adding documents, and performing an optimize() at
> >> the end of every run.
> >> In addition, we have a web-application (running in tomcat) that is used
> >> to perform searches on the index(es).
> >>
> >> The problem (probably because of my lack of knowledge) is that when the
> >> indexer has performed its optimize routine, marking the files as
> >> (deleted) in the filesystem (a unix system), the files are not deleted,
> >> because tomcat is keeping the files locked... SO as you can all
> >> imagine, the lvm is ever growing... Problem of course solves itself
> >> through a Tomcat-restart, but that's not a very ideal solution to
> >> perform restarts every other day or so...
> >>
> >> I presume its the IndexReader and/or IndexSearcher that is keeping the
> >> files locked for deletion (in the web-application). So, i was wondering
> >> if any of you have any input on how I can release the files (or
> >> actually just delete them after the optimize routine in the indexer
> >> right away...
> >>
> >> Than you very much for any feedback!
> >>
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> --
> Aleksander M. Stensby
> Senior Software Developer
> Integrasco A/S
> [EMAIL PROTECTED]
> Tlf.: +47 41 22 82 72
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Custom normalization in Similarity

2007-08-29 Thread Emmanuel Franquemagne

Hello,

I'd like to know if there is a way to perform custom correction to the 
similarity norm before it is written
At best, we wished we could do this by extending the Similarity class, 
but encodeNorm, that would be the best place to do it, is a static 
method and thus it's no use to override it.


Is there a reason why this method is static? And is there any solution 
that could allow us to do this value correction?


Thanks for any help,
Emmanuel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom normalization in Similarity

2007-08-29 Thread Mark Miller
I think that encodeNorm and decodeNorm on Similarity are really just 
utility methods for norm encode/decode. It would be nice to be able to 
override those if you wanted to change the encode/decode method, but you 
should be able to modify the norm elsewhere. Actual access to the norm 
information is handled in IndexReader and this behavior can be 
overridden. This short example would hide the norms stored in the index 
and instead return 1.0f for each doc:


public class FakeNormsIndexReader extends FilterIndexReader {
   byte[] ones = SegmentReader.createFakeNorms(maxDoc());

   public FakeNormsIndexReader(IndexReader in) {
   super(in);

   }


   public synchronized byte[] norms(String field) throws IOException {

   return ones;
   }

   public synchronized void norms(String field, byte[] result, int 
offset) {


   System.arraycopy(ones, 0, result, offset, maxDoc());
   }
}

- Mark

Emmanuel Franquemagne wrote:

Hello,

I'd like to know if there is a way to perform custom correction to the 
similarity norm before it is written
At best, we wished we could do this by extending the Similarity class, 
but encodeNorm, that would be the best place to do it, is a static 
method and thus it's no use to override it.


Is there a reason why this method is static? And is there any solution 
that could allow us to do this value correction?


Thanks for any help,
Emmanuel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Postal Code Radius Search

2007-08-29 Thread Mike
I've searched the mailing list archives, the web, read the FAQ, etc and I
don't see anything relevant so here it goes…

I'm trying to implement a radius based searching based on zip/postal codes.
 (The user enters their zip code and I show nearby matches under x miles
away sorted by linear distance.)  I already have the data required to pull
this off (zip codes, long/lat coordinates, etc.)   Extreme accuracy is not a
requirement.  It just needs to be an approximation (plus or minus a few
miles.)

What I'm looking for is a little direction.  How have others implemented
this type of search?  What are the pros/cons of various methods?  I have a
few ideas but obviously none of them are very good or I guess I wouldn't be
here asking.  ;)

By the way, my index is updated about every 10 minutes and holds about
25,000 records.  However, this may increase in the next year or so to
hundreds of thousands.  So whatever I do needs to be fairly scalable.  The
items being searched as well as the people searching will be located all
over the world.   Some areas may be busier than others so there is an
opportunity for caching more common locals.

Thank you for your time.  I'd appreciate any suggestions that you can give.

- Mike


Re: Postal Code Radius Search

2007-08-29 Thread Will Johnson
a CustomScoreQuery combined with a FieldCacheSource that holds the  
the lat/lon might work.


- will


On Aug 29, 2007, at 11:15 AM, Mike wrote:

I've searched the mailing list archives, the web, read the FAQ, etc  
and I

don't see anything relevant so here it goes…

I'm trying to implement a radius based searching based on zip/ 
postal codes.
 (The user enters their zip code and I show nearby matches under x  
miles
away sorted by linear distance.)  I already have the data required  
to pull
this off (zip codes, long/lat coordinates, etc.)   Extreme accuracy  
is not a
requirement.  It just needs to be an approximation (plus or minus a  
few

miles.)

What I'm looking for is a little direction.  How have others  
implemented
this type of search?  What are the pros/cons of various methods?  I  
have a
few ideas but obviously none of them are very good or I guess I  
wouldn't be

here asking.  ;)

By the way, my index is updated about every 10 minutes and holds about
25,000 records.  However, this may increase in the next year or so to
hundreds of thousands.  So whatever I do needs to be fairly  
scalable.  The
items being searched as well as the people searching will be  
located all

over the world.   Some areas may be busier than others so there is an
opportunity for caching more common locals.

Thank you for your time.  I'd appreciate any suggestions that you  
can give.


- Mike




SV: Caching IndexSearcher in a webapp [was: Find "latest" document (before a certain date)]

2007-08-29 Thread Per Lindberg
Kalle and Patrick: many thanks for the suggestions!

Caching the IndexSearcher in the ServletContext sounds like a very good idea.
However, I have to index a number of databases, each with a different Lucene
index. So keeping an IndexSearcher for each may come with a prohibitive
memory cost. But as far as I can tell, speed is not a problem; creating a new
IndexSearcher for each new search is outweighed by HTTP protocol latency.

Thanks again!




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing fields with multiplicity

2007-08-29 Thread Karl Wettin


28 aug 2007 kl. 21.41 skrev Tim Sturge:


Hi,

I have fields which have high multiplicity; for example I have a  
topic with 1000 names, 500 of which are "USA" and 200 are "United  
States of America".


Previously I was indexing "USA USA .(500x).. USA United States of  
America .(200x).. United States of America" as as single field. The  
problem is that this causes this field to be less weighted for  
"USA" than a topic with a single name "USA".


So what I am now going to do is call

for (i = 0 ; i < 500 ; i++) {
   document.add(new Field("anchor","USA"));
}


Why do you do this? What is the effect you are looking for?


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Postal Code Radius Search

2007-08-29 Thread Steven Rowe
Mike wrote:
> I've searched the mailing list archives, the web, read the FAQ, etc and I
> don't see anything relevant so here it goes…
> 
> I'm trying to implement a radius based searching based on zip/postal codes.

Here is a selection of interesting threads from the Lucene ML with
relevant info:







The standard answer seems to be something like:

1. Index latitude and longitude fields with fixed length
(left-zero-padded) integral values - shift the decimal point to the
right to the desired level of discriminability.  (In your case, convert
the postal codes to lats/longs.)

2. Do a range query on both your lat and your long fields to collect
hits inside a bounding box with your target at the center and with sides
of length double the desired radius.

3. Optionally, sort (and filter) the results by distance from your
target, displaying only those within the desired radius.  If you leave
out this step, you'll get some hits that are outside of the desired
radius - inbetween the bounding circle and the bounding box.

Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing fields with multiplicity

2007-08-29 Thread Tim Sturge
I'm looking for a boost when the anchor text is more commonly associated 
with one topic than another. For example the United States of America
is called "USA" by a lot of people. The United Space Alliance is also 
called "USA" but by many less people.


If I just index them both with "USA" once, they will rank equally. I 
want the United States of America to rank higher.


Tim

Karl Wettin wrote:


28 aug 2007 kl. 21.41 skrev Tim Sturge:


Hi,

I have fields which have high multiplicity; for example I have a 
topic with 1000 names, 500 of which are "USA" and 200 are "United 
States of America".


Previously I was indexing "USA USA .(500x).. USA United States of 
America .(200x).. United States of America" as as single field. The 
problem is that this causes this field to be less weighted for "USA" 
than a topic with a single name "USA".


So what I am now going to do is call

for (i = 0 ; i < 500 ; i++) {
document.add(new Field("anchor","USA"));
}


Why do you do this? What is the effect you are looking for?





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Postal Code Radius Search

2007-08-29 Thread Charles Patridge
Here is an example of getting all the zipcodes within a certain radius -


Something I did in SAS but I am sure you can convert the formula into
another language.

http://www.sconsig.com/sastips/tip00156.htm

Chuck Patridge

Charles Patridge
Full Capture Solutions, Inc.
333 Roberts Street, Suite 400
East Hartford, CT 06108
Phone: 860-291-9517 x 106
Email: [EMAIL PROTECTED]
-Original Message-
From: Steven Rowe [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 29, 2007 12:37 PM
To: java-user@lucene.apache.org
Subject: Re: Postal Code Radius Search

Mike wrote:
> I've searched the mailing list archives, the web, read the FAQ, etc
and I
> don't see anything relevant so here it goes...
> 
> I'm trying to implement a radius based searching based on zip/postal
codes.

Here is a selection of interesting threads from the Lucene ML with
relevant info:







The standard answer seems to be something like:

1. Index latitude and longitude fields with fixed length
(left-zero-padded) integral values - shift the decimal point to the
right to the desired level of discriminability.  (In your case, convert
the postal codes to lats/longs.)

2. Do a range query on both your lat and your long fields to collect
hits inside a bounding box with your target at the center and with sides
of length double the desired radius.

3. Optionally, sort (and filter) the results by distance from your
target, displaying only those within the desired radius.  If you leave
out this step, you'll get some hits that are outside of the desired
radius - inbetween the bounding circle and the bounding box.

Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Postal Code Radius Search

2007-08-29 Thread Charles Patridge
Will,

http://www.sconsig.com/sastips/tip00156.htm

This is an example I used written in SAS code which should be able to
convert to another language - to find all zipcodes within a certain
radius.

HTH,
Chuck P.

Charles Patridge
Full Capture Solutions, Inc.
333 Roberts Street, Suite 400
East Hartford, CT 06108
Phone: 860-291-9517 x 106
Email: [EMAIL PROTECTED]

-Original Message-
From: Will Johnson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 29, 2007 11:46 AM
To: java-user@lucene.apache.org
Subject: Re: Postal Code Radius Search

a CustomScoreQuery combined with a FieldCacheSource that holds the  
the lat/lon might work.

- will


On Aug 29, 2007, at 11:15 AM, Mike wrote:

> I've searched the mailing list archives, the web, read the FAQ, etc  
> and I
> don't see anything relevant so here it goes...
>
> I'm trying to implement a radius based searching based on zip/ 
> postal codes.
>  (The user enters their zip code and I show nearby matches under x  
> miles
> away sorted by linear distance.)  I already have the data required  
> to pull
> this off (zip codes, long/lat coordinates, etc.)   Extreme accuracy  
> is not a
> requirement.  It just needs to be an approximation (plus or minus a  
> few
> miles.)
>
> What I'm looking for is a little direction.  How have others  
> implemented
> this type of search?  What are the pros/cons of various methods?  I  
> have a
> few ideas but obviously none of them are very good or I guess I  
> wouldn't be
> here asking.  ;)
>
> By the way, my index is updated about every 10 minutes and holds about
> 25,000 records.  However, this may increase in the next year or so to
> hundreds of thousands.  So whatever I do needs to be fairly  
> scalable.  The
> items being searched as well as the people searching will be  
> located all
> over the world.   Some areas may be busier than others so there is an
> opportunity for caching more common locals.
>
> Thank you for your time.  I'd appreciate any suggestions that you  
> can give.
>
> - Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing fields with multiplicity

2007-08-29 Thread Karl Wettin


29 aug 2007 kl. 19.13 skrev Tim Sturge:

I'm looking for a boost when the anchor text is more commonly  
associated with one topic than another. For example the United  
States of America
is called "USA" by a lot of people. The United Space Alliance is  
also called "USA" but by many less people.


If I just index them both with "USA" once, they will rank equally.  
I want the United States of America to rank higher.


Why not use Field#setBoost(float)?


--
karl




Tim

Karl Wettin wrote:


28 aug 2007 kl. 21.41 skrev Tim Sturge:


Hi,

I have fields which have high multiplicity; for example I have a  
topic with 1000 names, 500 of which are "USA" and 200 are "United  
States of America".


Previously I was indexing "USA USA .(500x).. USA United States of  
America .(200x).. United States of America" as as single field.  
The problem is that this causes this field to be less weighted  
for "USA" than a topic with a single name "USA".


So what I am now going to do is call

for (i = 0 ; i < 500 ; i++) {
document.add(new Field("anchor","USA"));
}


Why do you do this? What is the effect you are looking for?





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-08-29 Thread Mike Klaas
I just meant whether it would live in a lucene release (somewhere  
under contrib/) or just in JIRA.  Would including the functionality  
in Solr help get it into lucene?


-Mike

On 29-Aug-07, at 4:58 AM, Mark Miller wrote:

It kind of is a contrib -- its really just a new Scorer class (with  
some axillary helper classes) for the old contrib Highlighter.  
Since the contrib Highlighter is pretty hardened at this point, I  
figured that was the best way to go. Or do you mean something  
different?


- Mark

Mike Klaas wrote:

Mark,

I'm still interested in integrating this into Solr--this is a  
feature that has been requested a few times.  It would be easier  
to do so if it were a contrib/...


thanks for the great work,
-Mike

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:

I am a bit unclear about your question. The patch you mention  
extends the original Highlighter to support phrase and span  
queries. It does not include any major performance increases over  
the original Highlighter (in fact, it takes a bit longer to  
Highlight a Span or Phrase query than it does to just highlight  
Terms).


Will it be released with the next version of Lucene? Doesn't look  
like it, but anything is possible. A few people are using it, but  
there has not been widespread interest that I have seen. My guess  
is that there are just not enough people trying to highlight Span  
queries -- which I'd blame on a lack of Span support in the  
default Lucene Query syntax.


Whether it is included soon or not, the code works well and I  
will continue to support it.


- Mark

Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, are  
these the

same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/ 
spanhighlighter10.patch


-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:



I have not looked at any highlighting code yet. Is there  
already an



extension


of PhraseQuery that has getSpans() ?



Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery)  
query).getTerms();

int i;
SpanQuery[] clauses = new SpanQuery 
[phraseQueryTerms.length];


for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms 
[i]);

}

SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit  
distance, but

it approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in  
the end.
Certainly, it would seem to require that you store offsets or  
you would

have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks.  
Ronnie's
Highlighter appears to be faster than the original due to two  
things: he
doesn't have to re-tokenize text and he rebuilds the original  
document
in large pieces. Depending on how you want to look at it, he  
loses most
of the speed gained from just looking at the Query tokens  
instead of all
tokens to pulling the Term offset information (which appears  
pretty slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long,  
you can

actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in  
bigger
pieces i.e. instead of going through each token and adding the  
source

text that it covers, build up the offset information until you get
another hit and then pull from the source text into the  
highlighted text
in one big piece rather than a tokens worth at a time. Of  
course this is
not compatible with the way the Fragmenter currently works. If  
you use
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's  
highlighter

wins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see  
in a

gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be  
as fast

as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and  
TokenSources is
certainly not worth it. It just takes too long to pull  
TermVector info.


- Mark



-- 
---

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTE

Large Index Architecture

2007-08-29 Thread Michael J. Prichard

Hello All,

I want to hear from those out there that have large (i.e. 50 GB+) 
indexes on how they have designed their architecture.  I currently have 
an index for email that is 10 GB and growing.  Right now there are no 
issues with it but I am about to get into an even bigger use for the 
software which will surely require access to much larger indexes.  
Should I begin to break off indexes into separate chunks and then search 
across them as needed?


For example, maybe break it out by "Month_Year Or "Day_Month_Year"?  
Ideas and experienced practices welcome!


Thanks,
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing fields with multiplicity

2007-08-29 Thread Tim Sturge

That's exactly my question. I feel like

for (i = 0 ; i <  ; i++) {
document.add(new Field("anchor","USA"));
}

is exactly equivalent to

field = new Field("anchor","USA"));
field.setBoost();
document.add(field);

but I don't know the function that relates  and . I feel like 
there's a correct information-theorectical answer and I'd like to know 
what it is.


Tim

Karl Wettin wrote:


29 aug 2007 kl. 19.13 skrev Tim Sturge:

I'm looking for a boost when the anchor text is more commonly 
associated with one topic than another. For example the United States 
of America
is called "USA" by a lot of people. The United Space Alliance is also 
called "USA" but by many less people.


If I just index them both with "USA" once, they will rank equally. I 
want the United States of America to rank higher.


Why not use Field#setBoost(float)?





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing fields with multiplicity

2007-08-29 Thread Karl Wettin


29 aug 2007 kl. 21.37 skrev Tim Sturge:


That's exactly my question. I feel like

for (i = 0 ; i <  ; i++) {
document.add(new Field("anchor","USA"));
}

is exactly equivalent to

field = new Field("anchor","USA"));
field.setBoost();
document.add(field);

but I don't know the function that relates  and . I feel  
like there's a correct information-theorectical answer and I'd like  
to know what it is.


You would have to refactor norm(t,d) in this computation:

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/ 
org/apache/lucene/search/Similarity.html


However, field boost is merged in to the document boost, so it might  
not translate that easy as you want. Perhaps payloads and  
BoostingTermQuery fits your needs better.



--
karl





Tim

Karl Wettin wrote:


29 aug 2007 kl. 19.13 skrev Tim Sturge:

I'm looking for a boost when the anchor text is more commonly  
associated with one topic than another. For example the United  
States of America
is called "USA" by a lot of people. The United Space Alliance is  
also called "USA" but by many less people.


If I just index them both with "USA" once, they will rank  
equally. I want the United States of America to rank higher.


Why not use Field#setBoost(float)?





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to speed-up index opening

2007-08-29 Thread Antoine Baudoux

Hello,

	I have an application with a 2GB index. A lot of documents (up to  
10.000 per day) are added/deleted to this index.


My customer would like to have a Maximum of 7 minutes delay between a  
media added to the system and its searchability through the index.


So each 7 minutes or so i have to re-open the index.

My index takes 1.5 minute to open. Its a long time compared to the 7  
minutes.


What are the key parameters influencing the opening time?

Is it proportional to the number of documents? to the number of  
distinct terms? If i reduce the number of terms will it load faster?


Is there another way to improve index opening speed?


thanks!

Antoine

--
Antoine Baudoux
Development Manager
[EMAIL PROTECTED]
Tél.: +32 2 333 58 44
GSM: +32 499 534 538
Fax.: +32 2 648 16 53




Re: How to speed-up index opening

2007-08-29 Thread Chris Lu
Hi, Antoine,

It does take a long time to open the index reader.
One thing you could do is to put new documents into one smaller index and
re-open it, it should be much faster.

Also, you may need to have one index reader open, and open a new index
reader, then close the previous index reader, to ensure no service down
time.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 8/29/07, Antoine Baudoux <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I have an application with a 2GB index. A lot of documents (up to
> 10.000 per day) are added/deleted to this index.
>
> My customer would like to have a Maximum of 7 minutes delay between a
> media added to the system and its searchability through the index.
>
> So each 7 minutes or so i have to re-open the index.
>
> My index takes 1.5 minute to open. Its a long time compared to the 7
> minutes.
>
> What are the key parameters influencing the opening time?
>
> Is it proportional to the number of documents? to the number of
> distinct terms? If i reduce the number of terms will it load faster?
>
> Is there another way to improve index opening speed?
>
>
> thanks!
>
> Antoine
>
> --
> Antoine Baudoux
> Development Manager
> [EMAIL PROTECTED]
> Tél.: +32 2 333 58 44
> GSM: +32 499 534 538
> Fax.: +32 2 648 16 53
>
>
>


Re: Large Index Architecture

2007-08-29 Thread Chris Lu
Index Partitioning should be a good idea.
It'll save a lot of time on index merging, incremental indexing.

Just my experience, partition size really depends on CPU, hard disk speed,
and memory size. Nowadays with Core 2 Duo, 10G size for each chunk should be
good.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 8/29/07, Michael J. Prichard <[EMAIL PROTECTED]> wrote:
>
> Hello All,
>
> I want to hear from those out there that have large (i.e. 50 GB+)
> indexes on how they have designed their architecture.  I currently have
> an index for email that is 10 GB and growing.  Right now there are no
> issues with it but I am about to get into an even bigger use for the
> software which will surely require access to much larger indexes.
> Should I begin to break off indexes into separate chunks and then search
> across them as needed?
>
> For example, maybe break it out by "Month_Year Or "Day_Month_Year"?
> Ideas and experienced practices welcome!
>
> Thanks,
> Michael
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Can a Lucene field be renamed in a Lucene index?

2007-08-29 Thread George Aroush
Hi everyone,

I have the following need and I wander what are my options or if anyone run
into it and has a solution / suggestion.

I'm indexing a SQL database.  Each table is a Lucene index.  Now, in table
"A", I have a field called "Foo".  When I index it into Lucene, I also end
up with a field called "Foo".  Later on, the SQL database administrator,
will change the field name from "Foo" to "Bar".  Once this happens, any new
records added to table "A" will be indexed into Lucene as "Bar".

The issue is this, Lucene index for table "A" now has documents with some
having a field called "Foo" and others with "Bar".  This is problematic
because now a user can't just search for "Foo:dog", but must search for
"Foo:dog Bar:dog".

So, what are my options here?  No, I can't re-index.  Ideally, I would like
to be able to say to Lucene, "rename the field 'Foo' to 'Bar' in the index
'A'" (even if it means using private APIs).  Is this possible?  Have you run
into this problem?  What was your solution?

Regards,

-- George


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can a Lucene field be renamed in a Lucene index?

2007-08-29 Thread Erik Hatcher

there was just this thread here recently:

  


hope that helps.

Erik


On Aug 29, 2007, at 10:03 PM, George Aroush wrote:


Hi everyone,

I have the following need and I wander what are my options or if  
anyone run

into it and has a solution / suggestion.

I'm indexing a SQL database.  Each table is a Lucene index.  Now,  
in table
"A", I have a field called "Foo".  When I index it into Lucene, I  
also end
up with a field called "Foo".  Later on, the SQL database  
administrator,
will change the field name from "Foo" to "Bar".  Once this happens,  
any new

records added to table "A" will be indexed into Lucene as "Bar".

The issue is this, Lucene index for table "A" now has documents  
with some
having a field called "Foo" and others with "Bar".  This is  
problematic
because now a user can't just search for "Foo:dog", but must search  
for

"Foo:dog Bar:dog".

So, what are my options here?  No, I can't re-index.  Ideally, I  
would like
to be able to say to Lucene, "rename the field 'Foo' to 'Bar' in  
the index
'A'" (even if it means using private APIs).  Is this possible?   
Have you run

into this problem?  What was your solution?

Regards,

-- George


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Can a Lucene field be renamed in a Lucene index?

2007-08-29 Thread George Aroush
Just read the thread.  Unfortunately, it doesn't offer a solution.

Is it possible to write a tool that will read the source index, and write it
to an output index with the field renamed?  No, the raw-text is not stored
in the Lucene index.

Thanks.

-- George

> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, August 29, 2007 10:10 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can a Lucene field be renamed in a Lucene index?
> 
> there was just this thread here recently:
> 
> tf4310043.html#a12269902>
> 
> hope that helps.
> 
>   Erik
> 
> 
> On Aug 29, 2007, at 10:03 PM, George Aroush wrote:
> 
> > Hi everyone,
> >
> > I have the following need and I wander what are my options or if 
> > anyone run into it and has a solution / suggestion.
> >
> > I'm indexing a SQL database.  Each table is a Lucene index. 
>  Now, in 
> > table "A", I have a field called "Foo".  When I index it 
> into Lucene, 
> > I also end up with a field called "Foo".  Later on, the SQL 
> database 
> > administrator, will change the field name from "Foo" to 
> "Bar".  Once 
> > this happens, any new records added to table "A" will be 
> indexed into 
> > Lucene as "Bar".
> >
> > The issue is this, Lucene index for table "A" now has 
> documents with 
> > some having a field called "Foo" and others with "Bar".  This is 
> > problematic because now a user can't just search for "Foo:dog", but 
> > must search for "Foo:dog Bar:dog".
> >
> > So, what are my options here?  No, I can't re-index.  
> Ideally, I would 
> > like to be able to say to Lucene, "rename the field 'Foo' 
> to 'Bar' in 
> > the index
> > 'A'" (even if it means using private APIs).  Is this possible?   
> > Have you run
> > into this problem?  What was your solution?
> >
> > Regards,
> >
> > -- George
> >
> >
> > 
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can a Lucene field be renamed in a Lucene index?

2007-08-29 Thread Chris Lu
The easiest solution would be to change the SQL to
 select Bar as Foo, ..., from your_table

Use an alias and maintain everything as before.

If it's not a solution, you may need to re-index everything.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 8/29/07, George Aroush <[EMAIL PROTECTED]> wrote:
>
> Hi everyone,
>
> I have the following need and I wander what are my options or if anyone
> run
> into it and has a solution / suggestion.
>
> I'm indexing a SQL database.  Each table is a Lucene index.  Now, in table
> "A", I have a field called "Foo".  When I index it into Lucene, I also end
> up with a field called "Foo".  Later on, the SQL database administrator,
> will change the field name from "Foo" to "Bar".  Once this happens, any
> new
> records added to table "A" will be indexed into Lucene as "Bar".
>
> The issue is this, Lucene index for table "A" now has documents with some
> having a field called "Foo" and others with "Bar".  This is problematic
> because now a user can't just search for "Foo:dog", but must search for
> "Foo:dog Bar:dog".
>
> So, what are my options here?  No, I can't re-index.  Ideally, I would
> like
> to be able to say to Lucene, "rename the field 'Foo' to 'Bar' in the index
> 'A'" (even if it means using private APIs).  Is this possible?  Have you
> run
> into this problem?  What was your solution?
>
> Regards,
>
> -- George
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: How to speed-up index opening

2007-08-29 Thread Michael Busch
Chris Lu wrote:
> Hi, Antoine,
> 
> It does take a long time to open the index reader.
> One thing you could do is to put new documents into one smaller index and
> re-open it, it should be much faster.
> 

We're planning to add a reopen() method to IndexReader that should
significantly speed up reopening a reader:
http://issues.apache.org/jira/browse/LUCENE-743.

This feature should be part of the next Lucene release.

- Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reduce copy error

2007-08-29 Thread Nguyen Manh Tien
When i run nutch, i alway met this error in reduce task and is run
very slow after this error.
Do any one know how to solve this problem.

Here is the log:
java.io.IOException: Insufficient space
at 
org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:673)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:631)