Strategies for updating indexes.

2005-04-05 Thread Lee Turner
Hi

 

I was wondering whether anyone has any experience of multithreaded
updates to indexes.  I the web app I am working on there are additions,
updates and deletes that need to happen to the index throughout the
runtime of the application.  Also, the application is run in a cluster
with each app server having its own index.  This means that periodically
each app server is going to have to go through a re-indexing process to
make sure that its index has all the changes from the other app servers
in it.  This process can take a few seconds so if another update to the
index occurs at this time it will need to be queued in some way to make
sure it happens after the re-indexing.

 

I was just wondering if anyone had any pointers for doing this kind of
thing.  Any help would be gratefully appreciated.

 

Many thanks

Lee

 

 

Lee Turner | Java Developer | Oyster Partners 

D. +44 (0)20 74461229 
T. +44 (0)20 7446 7500 

www.oyster.com

 


_
Internet communications are not secure and therefore Oyster Partners Ltd does 
not accept legal responsibility for the contents of this message. Any views or 
opinions presented are solely those of the author and do not necessarily 
represent those of Oyster Partners Ltd.

Re: Strategies for updating indexes.

2005-04-05 Thread Jens Kraemer
Hi,
please see comments below.

On Tue, Apr 05, 2005 at 08:38:04AM +0100, Lee Turner wrote:
> Hi
> 
> I was wondering whether anyone has any experience of multithreaded
> updates to indexes.  I the web app I am working on there are additions,
> updates and deletes that need to happen to the index throughout the
> runtime of the application.  Also, the application is run in a cluster
> with each app server having its own index.  This means that periodically
> each app server is going to have to go through a re-indexing process to
> make sure that its index has all the changes from the other app servers
> in it.  This process can take a few seconds so if another update to the
> index occurs at this time it will need to be queued in some way to make
> sure it happens after the re-indexing.
> 
> I was just wondering if anyone had any pointers for doing this kind of
> thing.  Any help would be gratefully appreciated.

I usually have a service class wrapping all access to the lucene index,
which has a queue where my Servlets or Actions put the documents to be
updated or added in.  There is a single instance of this class for the
whole web app, and a thread regularly waking up and processing the
elements of the queue.

Note the queue has to be threadsafe or has to be synchronized
externally. 

Since there is only one instance of this service class, it is the only
one who will ever write to the index (if the same index is not used by
other applications).

During re-indexing the thread regularly processing the queue will
be paused. After re-indexing it is started again, processing all pending 
changes from the queue. The re-indexing itself takes place in another
thread, which is started by quartz in my case.


hope this helps you somehow,

Jens


-- 
webit! Gesellschaft für neue Medien mbH  www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer   [EMAIL PROTECTED]
Schnorrstraße 76  Telefon +49 351 46766 0
D-01069 Dresden  Telefax +49 351 46766 66

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Strategies for updating indexes.

2005-04-05 Thread Lee Turner
Hi

Thank you for replying so quickly.  I am very pleased as I have just started 
down the road of implementing a solution which is very nearly exactly like the 
one you describe below.  It is good to know that I am not heading down a dead 
end.  I hadn't thought about the re-indexing thread pausing the queue to stop 
it processing while the re-indexing takes place.  I will also take a look at 
quartz.

Your input is very much appreciated

Many thanks
Lee


-Original Message-
From: Jens Kraemer [mailto:[EMAIL PROTECTED] 
Sent: 05 April 2005 09:30
To: java-user@lucene.apache.org
Subject: Re: Strategies for updating indexes.

Hi,
please see comments below.

On Tue, Apr 05, 2005 at 08:38:04AM +0100, Lee Turner wrote:
> Hi
> 
> I was wondering whether anyone has any experience of multithreaded
> updates to indexes.  I the web app I am working on there are additions,
> updates and deletes that need to happen to the index throughout the
> runtime of the application.  Also, the application is run in a cluster
> with each app server having its own index.  This means that periodically
> each app server is going to have to go through a re-indexing process to
> make sure that its index has all the changes from the other app servers
> in it.  This process can take a few seconds so if another update to the
> index occurs at this time it will need to be queued in some way to make
> sure it happens after the re-indexing.
> 
> I was just wondering if anyone had any pointers for doing this kind of
> thing.  Any help would be gratefully appreciated.

I usually have a service class wrapping all access to the lucene index,
which has a queue where my Servlets or Actions put the documents to be
updated or added in.  There is a single instance of this class for the
whole web app, and a thread regularly waking up and processing the
elements of the queue.

Note the queue has to be threadsafe or has to be synchronized
externally. 

Since there is only one instance of this service class, it is the only
one who will ever write to the index (if the same index is not used by
other applications).

During re-indexing the thread regularly processing the queue will
be paused. After re-indexing it is started again, processing all pending 
changes from the queue. The re-indexing itself takes place in another
thread, which is started by quartz in my case.


hope this helps you somehow,

Jens


-- 
webit! Gesellschaft für neue Medien mbH  www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer   [EMAIL PROTECTED]
Schnorrstraße 76  Telefon +49 351 46766 0
D-01069 Dresden  Telefax +49 351 46766 66

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
Internet communications are not secure and therefore Oyster Partners Ltd does 
not accept legal responsibility for the contents of this message. Any views or 
opinions presented are solely those of the author and do not necessarily 
represent those of Oyster Partners Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Strategies for updating indexes.

2005-04-05 Thread Nestel, Frank IZ/HZA-IOL
Hi,

we are using a very cautious method for batch upating. 

We have long (hours) running updates on our index, but 
complete reindexing would even be longer (days). But I
guess our strategy could be scaled down to hours or even
less.

So what we do is, we keep two instances
of the index. There is a file which contains a link to the
currently used index for reading. The search application
accesses the index through a thin transparent API which
whatches this file and can switch from one search request
to the next, but does keep the IndexReader open as long
as no switching is need.

When we reindex, we duplicate the "read" instance of the 
index and then do an selective update on this copied duplicate. 
Note we need some HD space and time for this. After indexing
there is some checking and comparison between old and new
instance of the index. Only if this looks successful we 
toggle the above "pointer" file and the transparent API switches
to the new index. In case human recognition still finds the
checked index bad, we can still switch back to the second
last presumably "good" index. 

Cheers,
Frank

>-Original Message-
>From: Lee Turner [mailto:[EMAIL PROTECTED] 
>Sent: Tuesday, April 05, 2005 9:38 AM
>To: java-user@lucene.apache.org
>Subject: Strategies for updating indexes.
>
>
>Hi
>
> 
>
>I was wondering whether anyone has any experience of 
>multithreaded updates to indexes.  I the web app I am working 
>on there are additions, updates and deletes that need to 
>happen to the index throughout the runtime of the application. 
> Also, the application is run in a cluster with each app 
>server having its own index.  This means that periodically 
>each app server is going to have to go through a re-indexing 
>process to make sure that its index has all the changes from 
>the other app servers in it.  This process can take a few 
>seconds so if another update to the index occurs at this time 
>it will need to be queued in some way to make sure it happens 
>after the re-indexing.
>
> 
>
>I was just wondering if anyone had any pointers for doing this 
>kind of thing.  Any help would be gratefully appreciated.
>
> 
>
>Many thanks
>
>Lee
>
> 
>
> 
>
>Lee Turner | Java Developer | Oyster Partners 
>
>D. +44 (0)20 74461229 
>T. +44 (0)20 7446 7500 
>
>www.oyster.com
>
> 
>
>
>___
>__
>Internet communications are not secure and therefore Oyster 
>Partners Ltd does not accept legal responsibility for the 
>contents of this message. Any views or opinions presented are 
>solely those of the author and do not necessarily represent 
>those of Oyster Partners Ltd.
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: exact match

2005-04-05 Thread Yura Smolsky
Hello, Erik.


EH> On Apr 4, 2005, at 4:34 PM, Yura Smolsky wrote:
>> Hello, java-user.
>>
>> I have documents with tokenized, indexes and stored field. This field
>> contain one-two words usually. I need to be able to search exact
>> matches for two words.
>> For example search "John" should return documents with field
>> containing "John" only, not "John Doe" or "John Foo".
>>
>> Any ideas?
EH> Use an untokenized field to search on in the case of finding an exact
EH> match.

And no other ways to reach this?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[2]: exact match

2005-04-05 Thread Erik Hatcher
On Apr 5, 2005, at 5:44 AM, Yura Smolsky wrote:
EH> On Apr 4, 2005, at 4:34 PM, Yura Smolsky wrote:
Hello, java-user.
I have documents with tokenized, indexes and stored field. This field
contain one-two words usually. I need to be able to search exact
matches for two words.
For example search "John" should return documents with field
containing "John" only, not "John Doe" or "John Foo".
Any ideas?
EH> Use an untokenized field to search on in the case of finding an 
exact
EH> match.

And no other ways to reach this?
Not that I know of.  Could you give us a more concrete example of what 
you're trying to achieve?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: scalability w/ number of fields

2005-04-05 Thread Bill Au
The compound index structure is meant for indexes with a large number of fields.
I was watching the files in the index directory of my compound index while
it was being optimized.  The IndexWriter that I used was set to use
compound file.
It looks to me that Lucene first combined all existing segments into a new
multifile segment, then it converted this multifile segment into the
compound format.
So I think the data for the entire index is actually being written to
disk twice.
Is there any way to configure Lucene to write the data once only into a compound
segment without first writing a multifile segment first?

Bill

On Apr 4, 2005 6:40 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> They are all indexed (and they all need to be under the current design).
> 
> -Yonik
> 
> On Apr 4, 2005 6:16 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > Yonik Seeley wrote:
> > > I know Lucene is very scalable in many ways, but how about number of 
> > > fieldnames?
> > >
> > > We have an index using around 6000 unique fieldnames,
> >
> > How many of these fields are indexed?  At this point I would recommend
> > against having more than a handful of indexed fields.  If the fields are
> > only stored, then it shouldn't make much difference.
> >
> > Doug
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Time taken in Indexing when the index is already huge

2005-04-05 Thread Will Allen
I would recommend not optimizing your index that often.  Another solution is to 
use the multisearcher and keep one fully optimized primary index, and an 
unoptimized secondary index that you add to.  Then search against both.  During 
off peak hours you could merge the secondary index onto your primary index, 
then optimize .

-Original Message-
From: Goel, Nikhil [mailto:[EMAIL PROTECTED]
Sent: Monday, April 04, 2005 10:14 PM
To: java-user@lucene.apache.org
Subject: Time taken in Indexing when the index is already huge


Hi, 

   

I have been using lucene-1.3.jar for quite some time and we are using another 
library to store the index in DB. 

When we started indexing  the writer.optimize used to take in the range of 
600-800 milliseconds to return but now our index has grown to huge proportion 
and its around 10 MB hence the writer.optimize is taking around 30-40 seconds 
and it is not acceptable for our solution. I put the timings on 
writer.optimize() and it's the one which takes most of this time. 

 

So I am just wondering if someone is facing the same problem in indexing the 
data when the index is already huge or is there another way to manage such huge 
index.

 

Here is the simple code which we use to index the data. 

IndexWriter writer = new IndexWriter(dbDirectory, new StandardAnalyzer(), 
false); //Create an indexwriter

writer.addDocument(doc); //doc is of type  
org.apache.lucene.document.Document...

writer.optimize(); //optimize is called on indexwriter..This is the one which 
takes most of the time and is responsible for the delay.

writer.close(); // indexwriter is closed

 

 

The time taken by optimize call grows a lot when the index is of larger size. I 
tried to look it up on Erik Hatcher and Otis Gospodnetić 
  book too but everywhere it 
says Lucene is quite scalable and don't have trouble in indexing even with huge 
data. Can anyone please provide  some insight into this?

 

Thanks.

Nikhil

 

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strategies for updating indexes.

2005-04-05 Thread Otis Gospodnetic
If you take this approach, keep in mind that you will also need to
handle regular application shutdowns, and also try to catch some
crashes/errors, in order to flush your in-memory queue of items
scheduled for indexing, and write them to disk.

Feel free to post the code, if you want and can, so people don't have
to reinvent this.

Otis


--- Jens Kraemer <[EMAIL PROTECTED]> wrote:
> Hi,
> please see comments below.
> 
> On Tue, Apr 05, 2005 at 08:38:04AM +0100, Lee Turner wrote:
> > Hi
> > 
> > I was wondering whether anyone has any experience of multithreaded
> > updates to indexes.  I the web app I am working on there are
> additions,
> > updates and deletes that need to happen to the index throughout the
> > runtime of the application.  Also, the application is run in a
> cluster
> > with each app server having its own index.  This means that
> periodically
> > each app server is going to have to go through a re-indexing
> process to
> > make sure that its index has all the changes from the other app
> servers
> > in it.  This process can take a few seconds so if another update to
> the
> > index occurs at this time it will need to be queued in some way to
> make
> > sure it happens after the re-indexing.
> > 
> > I was just wondering if anyone had any pointers for doing this kind
> of
> > thing.  Any help would be gratefully appreciated.
> 
> I usually have a service class wrapping all access to the lucene
> index,
> which has a queue where my Servlets or Actions put the documents to
> be
> updated or added in.  There is a single instance of this class for
> the
> whole web app, and a thread regularly waking up and processing the
> elements of the queue.
> 
> Note the queue has to be threadsafe or has to be synchronized
> externally. 
> 
> Since there is only one instance of this service class, it is the
> only
> one who will ever write to the index (if the same index is not used
> by
> other applications).
> 
> During re-indexing the thread regularly processing the queue will
> be paused. After re-indexing it is started again, processing all
> pending 
> changes from the queue. The re-indexing itself takes place in another
> thread, which is started by quartz in my case.
> 
> 
> hope this helps you somehow,
> 
> Jens
> 
> 
> -- 
> webit! Gesellschaft für neue Medien mbH  www.webit.de
> Dipl.-Wirtschaftsingenieur Jens Krämer   [EMAIL PROTECTED]
> Schnorrstraße 76  Telefon +49 351 46766 0
> D-01069 Dresden  Telefax +49 351 46766 66
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: scalability w/ number of fields

2005-04-05 Thread Yonik Seeley
Optimize performance update (with tons of indexed fields):

We had a timing bug... ignore the hour I first reported.  Here are the
current numbers:

indexed_fields=6791  index_size=3.9GB  optimize_time=21min
indexed_fields=3216  index_size=2.0GB  optimize_time=9min
indexed_fields=2080  index_size=1.4GB  optimize_time=4min

It's a little apples-to-oranges since we simply removed some of the
fields to test a lower field count (and hence the index size also goes
down).

-Yonik

On Apr 4, 2005 5:38 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> I know Lucene is very scalable in many ways, but how about number of 
> fieldnames?
> 
> We have an index using around 6000 unique fieldnames,
> 450,000 documents, and a total index size of 4GB.   It's very
> sparse... documents don't have that many fields, but the number of
> different fieldtypes is huge.
> 
> An optimize of this index took about an hour (mergefactor 10, compound index)
> This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
> The JVM was Java5 with 2.5GB heap.
> 
> This seems very long... anyone have any insights?
> We'll be running more tests to see if decreasing the number of fields
> has an impact.
> 
> -Yonik
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-04-05 Thread Giovanni Novelli
As Lucene native language is Java it should be more natural to access its 
functionalities through JSP; anyway the idea of accessing Lucene 
functionalities seems interesting as PHP is perhaps most widely deployed 
server side scripting language. 
I think that the way to provide access to Lucene API in PHP development 
should be more general and clean as possible, so in my opinion the natural 
way should be based on a single layer that interoperates with Lucene API: an 
Apache module. Then should be needed a PHP API to call such module from PHP. 

Having an Apache module for Lucene as a component of Lucene project should 
allow the spread of Lucene not only in PHP development arena.

On Mar 27, 2005 5:49 AM, Owen Densmore <[EMAIL PROTECTED]> wrote:
> 
> Thanks all for the interesting responses. Sorry for being a bit late
> in responding!
> 
> -- Owen
> 
> Owen Densmore - http://backspaces.net - http://redfish.com -
> [EMAIL PROTECTED]
> 
> Begin forwarded message:
> 
> > From: "Philippe Ombredanne" <[EMAIL PROTECTED]>
> > Subject: RE: PHP-Lucene Integration
> >
> > Owen,
> > very interesting!
> > Anything (code) you can share?
> 
> Hi Philippe. We will definitely make our code available. I suspect,
> however, it is not terribly interesting. But if simply useful as a
> "case study" that would still be good.
> 
> > From: Dawid Weiss <[EMAIL PROTECTED]>
> > Subject: Re: PHP-Lucene Integration
> >
> > Your implementation and ideas sound very interesting, Owen. Can we see
> > the system anywhere in public (and play with it?)
> 
> We'll send a link to the site fairly soon. We're having our final
> review tomorrow, and should have a good idea when we can let folks look
> at it.
> 
> >> We are hoping the institute can afford to have us work on true
> >> clustering techniques such as Carrot2 uses. (Thanks to Dawid and all
> >> the Poznan University folks who's papers were so stimulating!)
> >
> > You are very welcome. We are also academic, so in the feeling of
> > brotherhood we might help you set up a demo on-line clustering server
> > free of charge. There really is not better clustering technique than
> > the one devised to a particular problem and it seems like you found
> > that niche. Although it's always worth experimenting with other stuff
> > just for the sake of comparison. Just let me know if you're interested
> > (if we can access the 'feed' of those plain search results I can set
> > up the clustering demo in a few minutes, really).
> 
> This would be really great! Indeed, we'd like to help SFI to be a bit
> more involved with exploring their collection with innovative, research
> oriented methods.
> 
> Some of the staff at SFI are excited by DSpace, for example, and we'd
> be interested in helping them explore its use in the lucene/clustering
> context. That, and their use of Dublin Core for cataloging their
> future work might be of general interest here in the mail list too.
> 
> > > We did do a
> >> quick LSA SVD on a random set of the papers to see what the
> >> performance (both CPU and good clustering) would be like. Our
> >> results are encouraging, and I think the frequent phrases approach
> >> would be best for this collection.
> >
> > It is always going to be challanging if you attempt to cluster the
> > entire collection, you know. I'm (or rather: I will be) working on
> > algorithm's extensions to deal with full text documents.
> 
> We're mainly using Abstracts and other meta data (Title, Authors, Key
> phrases, Abstracts, Dates, and so on). These are reasonably small:
> Abstracts are 150 words on the average over the current 1122 document
> collection. If we include the title and key phrases, we get 172
> words/doc.
> 
> I suspect we could safely limit the abstracts to the first few
> sentences too, getting us to a much smaller number. Indeed, if we
> tossed the abstracts altogether, and used just titles and key phrases,
> we're down to less than 20 words/doc! I bet simply using reasonable
> preprocessing we could get small enough "snippets" as to be workable.
> 
> > Dawid
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>


Re: PHP-Lucene Integration

2005-04-05 Thread Andi Vajda
As an alternative, you could also take the approach taken for PyLucene: 
compile the Java code with GCJ and generate bindings for Python with SWIG.
SWIG supports a number of languages in addition to Python such as Ruby, PHP, 
Perl, and a bunch more.

For more information, see:
  http://pylucene.osafoundation.org
  http://www.python.org/pycon/2005/papers/27/paper.txt
  http://www.swig.org
  http://gcc.gnu.org/java
As a matter of fact, a team of people is working on such a construction for 
Ruby at the moment.

Andi..
On Tue, 5 Apr 2005, Giovanni Novelli wrote:
As Lucene native language is Java it should be more natural to access its
functionalities through JSP; anyway the idea of accessing Lucene
functionalities seems interesting as PHP is perhaps most widely deployed
server side scripting language.
I think that the way to provide access to Lucene API in PHP development
should be more general and clean as possible, so in my opinion the natural
way should be based on a single layer that interoperates with Lucene API: an
Apache module. Then should be needed a PHP API to call such module from PHP.
Having an Apache module for Lucene as a component of Lucene project should
allow the spread of Lucene not only in PHP development arena.
On Mar 27, 2005 5:49 AM, Owen Densmore <[EMAIL PROTECTED]> wrote:
Thanks all for the interesting responses. Sorry for being a bit late
in responding!
-- Owen
Owen Densmore - http://backspaces.net - http://redfish.com -
[EMAIL PROTECTED]
Begin forwarded message:
From: "Philippe Ombredanne" <[EMAIL PROTECTED]>
Subject: RE: PHP-Lucene Integration
Owen,
very interesting!
Anything (code) you can share?
Hi Philippe. We will definitely make our code available. I suspect,
however, it is not terribly interesting. But if simply useful as a
"case study" that would still be good.
From: Dawid Weiss <[EMAIL PROTECTED]>
Subject: Re: PHP-Lucene Integration
Your implementation and ideas sound very interesting, Owen. Can we see
the system anywhere in public (and play with it?)
We'll send a link to the site fairly soon. We're having our final
review tomorrow, and should have a good idea when we can let folks look
at it.
We are hoping the institute can afford to have us work on true
clustering techniques such as Carrot2 uses. (Thanks to Dawid and all
the Poznan University folks who's papers were so stimulating!)
You are very welcome. We are also academic, so in the feeling of
brotherhood we might help you set up a demo on-line clustering server
free of charge. There really is not better clustering technique than
the one devised to a particular problem and it seems like you found
that niche. Although it's always worth experimenting with other stuff
just for the sake of comparison. Just let me know if you're interested
(if we can access the 'feed' of those plain search results I can set
up the clustering demo in a few minutes, really).
This would be really great! Indeed, we'd like to help SFI to be a bit
more involved with exploring their collection with innovative, research
oriented methods.
Some of the staff at SFI are excited by DSpace, for example, and we'd
be interested in helping them explore its use in the lucene/clustering
context. That, and their use of Dublin Core for cataloging their
future work might be of general interest here in the mail list too.
We did do a
quick LSA SVD on a random set of the papers to see what the
performance (both CPU and good clustering) would be like. Our
results are encouraging, and I think the frequent phrases approach
would be best for this collection.
It is always going to be challanging if you attempt to cluster the
entire collection, you know. I'm (or rather: I will be) working on
algorithm's extensions to deal with full text documents.
We're mainly using Abstracts and other meta data (Title, Authors, Key
phrases, Abstracts, Dates, and so on). These are reasonably small:
Abstracts are 150 words on the average over the current 1122 document
collection. If we include the title and key phrases, we get 172
words/doc.
I suspect we could safely limit the abstracts to the first few
sentences too, getting us to a much smaller number. Indeed, if we
tossed the abstracts altogether, and used just titles and key phrases,
we're down to less than 20 words/doc! I bet simply using reasonable
preprocessing we could get small enough "snippets" as to be workable.
Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Can not delete cfs file

2005-04-05 Thread Gusenbauer Stefan
Gusenbauer Stefan wrote:

>Erik Hatcher wrote:
>
>  
>
>>On Apr 3, 2005, at 3:33 PM, Gusenbauer Stefan wrote:
>>
>>
>>
>>>Sorry for beeing late!
>>>Only the test code wouldn't be very useful for understanding because
>>>there are a lot of dependencies in the other code. I can explain what
>>>I do: I open an IndexWrite close it then open an IndexReader close it
>>>and open an IndexWriter then close it. Then i try to delete the files
>>>from the index and only the cfs file i cannot delete. I try to get
>>>out the code which is involved later on. I get no failure message
>>>only the
>>>that the fail could not be removed.
>>>  
>>>
>>This is on Windows, I presume?
>>
>>Erik
>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>>
>>
>Yes, I think it is a java with windows problem because since a use call
>System.gc before deleting the file the streams are released.
>Stefan
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>  
>
I found the bug, i simply didn't close the indexreader in the testcase
before. Shame on me that I've assumed that lucene had a bug!
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser: open ended range queries

2005-04-05 Thread Yonik Seeley
Was there any later thread on the QueryParser supporting open ended
range queries after this:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg07973.html

Just curious.  I plan on overriding the current getRangeQuery() anyway
since it currently doesn't run the endpoints through the analyzer.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser: open ended range queries

2005-04-05 Thread Erik Hatcher
On Apr 5, 2005, at 2:49 PM, Yonik Seeley wrote:
Just curious.  I plan on overriding the current getRangeQuery() anyway
since it currently doesn't run the endpoints through the analyzer.
What will you do when multiple tokens are returned from the analyzer?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: QueryParser: open ended range queries

2005-04-05 Thread Yonik Seeley
For numeric fields, this will never happen.
For text fields, I could either
 1) just use the first token generated (yuck)
  2) don't run it through the analyzer (v1.0)
  3) run it through an analyzer specific to range and prefix queries (post v1.0)

Since I know the schema, I can pick and choose different methods for
different field types.  Generic lucene isn't as lucky and has to guess
(hence the ugly try-to-parse-as-a-date code).

An example of why option3 may be needed: consider the recently posted
ISOLatinFilter that stripps accents.  If one indexes text:applé, and
it gets indexed as text:apple, then a range query of text:[applé TO
orange] won't find that document.

Of course you just can't run it through the normal analyzer either
since then text:[a to z] probably won't work (a will get stopped out,
etc).  Also, the normal analyzer may expand things into synonyms, etc.

-Yonik


On Apr 5, 2005 3:43 PM, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> 
> On Apr 5, 2005, at 2:49 PM, Yonik Seeley wrote:
> > Just curious.  I plan on overriding the current getRangeQuery() anyway
> > since it currently doesn't run the endpoints through the analyzer.
> 
> What will you do when multiple tokens are returned from the analyzer?
> 
> Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: filter search

2005-04-05 Thread Chris Hostetter
:
: is it possible to filter the hits returned from a certain query?. for
: example if I have a search like this:
:   Query searchQuery = queryParser.parse( query );
:   Hits  results = m_searcher.search( searchQuery );
: is there a way to use the results and find out how many of the returned
: documents their url ends with com, and how many ends with net and so
: on... without the need to form a new query?.

There was a recent thread regarding this issue, with several
implimentations suggested, each with different pros/cons depending on
the specifics of your situation...

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200503.mbox/[EMAIL 
PROTECTED]



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: exact match

2005-04-05 Thread Chris Hostetter
: >> I have documents with tokenized, indexes and stored field. This field
: >> contain one-two words usually. I need to be able to search exact
: >> matches for two words.
: >> For example search "John" should return documents with field
: >> containing "John" only, not "John Doe" or "John Foo".
: >>
: >> Any ideas?
: EH> Use an untokenized field to search on in the case of finding an exact
: EH> match.
:
: And no other ways to reach this?

are there any cases in which you ever want to search the field for
tokenized values?

if not, then you can just use an analyzer that knows about this special
field and "tokenizes" any value it gets into a single token that is an
exact match.

if you sometimes need exact matches, and sometimes need "ord" matches (for
hte sake of argument let's assume your tokens are simple shitespace
seperated words) then you're going to need somewhat of knowing which case
you wnat when you parse the query -- the easy way to go is with a seperate
field like Erik described.  if you have some other usecase, then you can
index the field using an analyzer that generates a single unparsed "token"
for hte whole string followed by some marker token that isn't likely to
appear in your data (ie: the token "_BOOYA!_" would probably work),
followed by the more conventional tokens -- given the first of those
tokens a very high  positionalIncriment.

Then if you want an exact match, your custom query parsing code could
generate either a Phrase or Span query containing a single Term for your
input, followed by the marker Term (ie: "_BOOYA!_").  a "regular" toekn
based search would work just as it did before.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Wiki formatting changes

2005-04-05 Thread Chris Hostetter

he wiki appears to have undergone some style cahnges recently, the layout
is a lot different now (and in my opinion: cleaner) but a side effect
seems to be that some page formatting which used to work no
longer does

Specifically, subSection headings that have leading whitespace, ie...

 == Utility to pad the numbers ==

...show up verbatim in the page, but removing the space...

== Utility to pad the numbers ==

...seems to fix the problem.


does anyone know what exactly changed?  is there an easy config option
that can be toggled to get the old behavior, or do we just need to slowly
tweak all of the existing docs (there are quite a few with this problem)
to eliminate the whitespace?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strategies for updating indexes.

2005-04-05 Thread Paul Smith

Otis Gospodnetic wrote:
If you take this approach, keep in mind that you will also need to
handle regular application shutdowns, and also try to catch some
crashes/errors, in order to flush your in-memory queue of items
scheduled for indexing, and write them to disk.
Feel free to post the code, if you want and can, so people don't have
to reinvent this.
Otis
 

This is where using something like JMS to store persistent messages of 
items for indexing in a JMS queue is useful. 

We are about to go down this road using ActiveMQ 
(http://activemq.codehaus.org, very nice product, Apache licensed), that 
way notifications of change are never lost, and you can disconnect the 
indexer from the application itself (it could be a seperate process or 
in-process, it just needs to be able to read the JMS queue).  With 
ActiveMQ you can even embed the JMS server instance inside the VM of 
your application too, which is very useful for a single instance, and 
can be easily broken out to be used in a clustered environment.

cheers,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Fwd: Wiki formatting changes

2005-04-05 Thread Erik Hatcher
I suppose this should be addressed to Leo...  anything we can do about 
the issue mentioned below regarding wiki formatting?

Thanks,
Erik
Begin forwarded message:
From: Chris Hostetter <[EMAIL PROTECTED]>
Date: April 5, 2005 5:56:28 PM EDT
To: java-user@lucene.apache.org
Subject: Wiki formatting changes
Reply-To: java-user@lucene.apache.org
he wiki appears to have undergone some style cahnges recently, the 
layout
is a lot different now (and in my opinion: cleaner) but a side effect
seems to be that some page formatting which used to work no
longer does

Specifically, subSection headings that have leading whitespace, ie...
 == Utility to pad the numbers ==
...show up verbatim in the page, but removing the space...
== Utility to pad the numbers ==
...seems to fix the problem.
does anyone know what exactly changed?  is there an easy config option
that can be toggled to get the old behavior, or do we just need to 
slowly
tweak all of the existing docs (there are quite a few with this 
problem)
to eliminate the whitespace?

-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


php/lucene integration: SFI working papers visualization

2005-04-05 Thread Owen Densmore
Hi folks.  As promised, here is the first beta access to the php/lucene  
work we were discussing earlier.

The url to the php front-end to the SFI working papers Lucene search is:
  http://webdev.santafe.edu/research/publications/redfish/wpSearch.php
This provides a fairly simple search dialog, returning a list of  
relevant documents.

The paragraph style returned data has links for continued searching on  
much of the data within the document's meta data: Authors, Keywords,  
and a similarity search .. similar documents.  It also has a "Browse  
paper in context" icon which launches a Flash graphical navigation  
tool.  It also has links into the rest of the SFI site: pdf/postscript  
for the papers, and an abstract page.

The "similar documents" search is a generalization of the example in  
the book by Erik and Otis: use a document's contents to form a  
secondary search.  Our default is authors^2 & all text.  But we've  
generalized it "inside" so that any primary search can be broken into  
any secondary search.  Thus simply using authors is a "co-author"  
similarity search.

The servlet is available via:
  http://webdev.santafe.edu:8080/redfish/servlet
It provides only raw text, all formatting and adaption to other web  
tools (Flash etc) is done via php.  Many capabilities of the servlet  
are not available via php at this point.  We default everything so that  
errors are minimized.  Thus beaming into the url w/o any parameters  
returns a canned search.  Note that it returns more than one search --  
a batch search of many searches is one of the servlets features.

This should let folks play with the critter.  Let us know if you find  
bugs or odd behaviors .. or find it useful even!  :)

-- Owen
Owen Densmore - http://backspaces.net - http://redfish.com -  
[EMAIL PROTECTED]

Here are some details for those interested.
The meta data fields available are:
Number   Working paper number
TitleWorking paper title
Author   Comma separated list of Authors
Abstract Working paper abstract.
Keywords Comma separated list of keyphrases
Format   Specifies availability of pdf, ps, none
We "manufacture" a few more fields from the above:
Text Fake field: Title+Keywords+Abstract
All  Fake field: All .. "Text"+Number+Author+Format
Date Fake field: /MM from Number
We typically just search All, augmenting with "Author:Crutchfield" if  
we want a specific field included in the search.  We use the built-in  
query parser.

The php interface does not provide an abstract but that can be done  
through the servlet "api".  For example, this search:
   
http://webdev.santafe.edu:8080/redfish/servlet?s=Author: 
Crutchfield&p=Abstract
..would return Jim Cruchfield's 55 abstracts, along with the rank and  
paper number.  Boy, is it FAST!

The URL api is:
cmd=search   Perform a search using params below.  Results
 have a search header with the query and number of hits,
			 followed by the individual search results unless the "p"
			 parameter is used.
   =debugPrint diagnostic info
   =like Return documents that are like the document given in the
 s=Number:xxx search string.  Note the search string must be
			 fully specified, due to the default search field, f= being
			 used to specify how the similarty search is performed.  I.e.
			 the similarity search is done with a search string of
			 :
			 The parameters (l=,p=,M=,m=) can be used to control the return
			 format and quanity.  See examples below.  This command is fine
			 for now, but is "in beta" and could revert to use of document
			 term vectors.
s=Searches (| separated list)
 A set of N searches to be made, separated by the |  
character.
s2=search|minRank2|maxResults2
 The search to use for the "like" command.  It has three  
parts,
			 separated by "|".  The first is a search, formatted like a print
			 field (p=) below, constructed from the parts of the first search.
			 The second and third parts are a minRank, maxResults pair to
			 be applied during the second search.  As an example:
			 s2=Author([Author])^2 Text([Text])|0.01|100
		 would use the Authors and Text fields of the first search (s=)
			 to construct the second search, using a minRank, maxResults
			 of 0.01 and 100.  The results are formatted according to the
			 p= field below, generally "matrix".
p=PrintField|PrintFormat with PrintTags|"matrix"
 If a field name is provided, search results are printed as:
			 [Rank]\t[Number]\t[]
			 If the printField contains any []'s, the search results are
			 custom formatted using tags.  Thus "[Number]" would return just
			 the number for the search.
   =matrix   Return matrix of hits for N searches.  Results have a  
header
	 with N queries/labels, tab separated, preceeded by an  
additional
			 "DocNo." label.  The search results have the doc number followed
			 by N ranks, tab separated, corresponding to

wildcarded phrase queries

2005-04-05 Thread Erik Hatcher
I have a need to implement wildcarded phrase queries, such as this:
"apach? luc*"
which would match "apache lucene", for example.  This needs to also 
support ordered and unordered proximity like SpanNearQuery does:

"apach? luc*"~10
I presume I'm going to have to key off of SpanQuery with a some 
specialized subclasses.

What approach do you recommend for implementing something like this?
Thanks,
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: wildcarded phrase queries

2005-04-05 Thread Chuck Williams
Erik Hatcher writes (4/5/2005 5:57 PM):
I have a need to implement wildcarded phrase queries, such as this:
"apach? luc*"
which would match "apache lucene", for example.  This needs to also 
support ordered and unordered proximity like SpanNearQuery does:

"apach? luc*"~10
I presume I'm going to have to key off of SpanQuery with a some 
specialized subclasses.

What approach do you recommend for implementing something like this?
Hi Erik,
Might it be as easy as creating a SpanWilcardQuery that transforms into 
a SpanOrQuery of SpanTermQuery's, and then use a SpanNearQuery of 
SpanWildcardQuery's?  You could use a WildcardTermEnum.to generate the 
list of terms for the SpanOrQuery.  This would have some issues like 
computing the idf as the sum of all the pattern-matched terms, but it 
looks like that issue still exists with WildcardQuery too.  I haven't 
done much with SpanQuery's so this might not work out so simply, or be 
acceptably efficient.

Chuck
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]