Re: n-gram word support

2009-06-19 Thread Otis Gospodnetic
Here it is: http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/shingle/ShingleMatrixFilter.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Neha Gupta > To: java-user@lucene.apache.org > Sent: Thursday, June 18, 2009 1

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-19 Thread Otis Gospodnetic
Hello, You may want to look at Lucene's younger brother named Solr: http://lucene.apache.org/solr/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: mitu2009 > To: java-user@lucene.apache.org > Sent: Friday, June 19, 2009 12:10:42 AM > Su

RE: caching an indexreader

2009-06-19 Thread Scott Smith
Thanks for the comments. Sounds like I will probably be ok. -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Friday, June 19, 2009 1:50 PM To: java-user@lucene.apache.org; java-...@lucene.apache.org Subject: Re: caching an indexreader On the topic of R

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
On the topic of RAM consumption, it seems like field caches could return estimated RAM usage (given they're arrays of standard Java types)? There's methods of calculating per platform (I believe relatively accurately). On Fri, Jun 19, 2009 at 12:11 PM, Michael McCandless < luc...@mikemccandless.co

Re: caching an indexreader

2009-06-19 Thread Michael McCandless
On Fri, Jun 19, 2009 at 2:40 PM, Scott Smith wrote: > In my environment, one of the concerns is that new documents are > constantly being added (and some documents may be deleted).  This means > that when a user does a search and pages through results, it is possible > that there are new items comi

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
> As I understand it, the user won't see any changes to the index until a new Searcher is created. Correct. > How much memory will caching the searcher cost? Are there other tradeoff's I need to consider? If you're updating the index frequently (every N seconds) and the searcher/reader is closed

Filters vs Queries - revisited

2009-06-19 Thread Scott Smith
As I read about Filters, it seems to me that a filter is preferred for any portion of the query string where you are setting the boost to 0 (meaning you don't want it to contribute to the relevancy score). But, relevancy is only interesting if you are displaying the documents in relevancy ord

caching an indexreader

2009-06-19 Thread Scott Smith
In my environment, one of the concerns is that new documents are constantly being added (and some documents may be deleted). This means that when a user does a search and pages through results, it is possible that there are new items coming in which affect the search-thus changing where items are

Re: getting all Lucene internal IDs

2009-06-19 Thread Michael McCandless
You're welcome! Mike On Fri, Jun 19, 2009 at 1:49 PM, Dmitry Lizorkin wrote: >> Iterate over all ints from 0 .. IndexReader.maxDoc() (exclusive) and >> call IndexReader.isDeleted? > > Excellent, works perfect for us! > > Michael, thank you very much for your help! > > Best regards, >  Dmitry > >

Re: getting all Lucene internal IDs

2009-06-19 Thread Dmitry Lizorkin
Iterate over all ints from 0 .. IndexReader.maxDoc() (exclusive) and call IndexReader.isDeleted? Excellent, works perfect for us! Michael, thank you very much for your help! Best regards, Dmitry - To unsubscribe, e-mail: ja

Re: getting all Lucene internal IDs

2009-06-19 Thread Michael McCandless
On Fri, Jun 19, 2009 at 12:43 PM, Dmitry Lizorkin wrote: > In the meantime, does there exist any workaround for the current version > 2.4.1 we're using? Iterate over all ints from 0 .. IndexReader.maxDoc() (exclusive) and call IndexReader.isDeleted? Open a read-only IndexReader if possible, so i

Re: getting all Lucene internal IDs

2009-06-19 Thread Dmitry Lizorkin
Assuming your goal is to exclude deleted docs Yes, precisely. TermDocs td = IndexReader.termDocs(null); That looks exactly what we need! We'll be looking forward to the release of v. 2.9. In the meantime, does there exist any workaround for the current version 2.4.1 we're using? Thank

Re: getting all Lucene internal IDs

2009-06-19 Thread Michael McCandless
Assuming your goal is to exclude deleted docs, in 2.9 (not yet released) you can do this: TermDocs td = IndexReader.termDocs(null); and then iterate through them. Mike 2009/6/19 Dmitry Lizorkin : > Hello! > > What is the appropriate way to obtain Lucene internal IDs for _all_ the > tuples sto

getting all Lucene internal IDs

2009-06-19 Thread Dmitry Lizorkin
Hello! What is the appropriate way to obtain Lucene internal IDs for _all_ the tuples stored in a Lucene index? Thank you for your help Dmitry - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional

Re: Collector Pagination

2009-06-19 Thread João Silva
Nice Uwe, i'll try this. Thanks, Galaio On Fri, Jun 19, 2009 at 1:33 PM, Uwe Schindler wrote: > To get the second page, > Take: > int hitsPerPage = 10; > int pageOffset = 10; > TopDocCollector collector = new TopDocCollector(hitsPerPage + pageOffset); > > For page third page take int pageOffset

RE: Collector Pagination

2009-06-19 Thread Uwe Schindler
To get the second page, Take: int hitsPerPage = 10; int pageOffset = 10; TopDocCollector collector = new TopDocCollector(hitsPerPage + pageOffset); For page third page take int pageOffset = 20; and so on After that your results are in hits[], for the first page in [0] to [9], the second page in

Re: Collector Pagination

2009-06-19 Thread João Silva
well, i have somthing like that: int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(this.indexPath); TopDocCollector collector = new TopDocCollector(hitsPerPage); Query query = new QueryParser("", this.analizer).parse(DocumentRepositoryEntry.Fiel

Re: n-gram word support

2009-06-19 Thread Grant Ingersoll
The contrib/analyzers has several n-gram based tokenization and token filter options. On Jun 18, 2009, at 10:15 PM, Neha Gupta wrote: Hey, I was wondering if there is a way to read the index and generate n- grams of words for a document in lucene? I am quite new to it and am using pylucen

Re: Collector Pagination

2009-06-19 Thread João Silva
Thanks Uwe, I will see that. Galaio On Fri, Jun 19, 2009 at 12:36 PM, Uwe Schindler wrote: > Hallo, > > Just retrieve the TopDocs for the first n documents, where n = > offset+count, > where offset is the first hit on the page (0-based) and count the number > per > page. > To display the resu

RE: Collector Pagination

2009-06-19 Thread Uwe Schindler
Hallo, Just retrieve the TopDocs for the first n documents, where n = offset+count, where offset is the first hit on the page (0-based) and count the number per page. To display the results you would then just start at offset in TopDocs and retrieve the stored field from there to offset+count. Uw

Collector Pagination

2009-06-19 Thread João Silva
Hi, is there any api form of Hits pagination? for example, if i want to retreve the hits between an interval. -- Cumprimentos, João Carlos Galaio da Silva

Re: windows locking file problem

2009-06-19 Thread Michael McCandless
It's best to let IndexWriter manage the deletion of files (for exactly this reason). It turns out, it's perfectly fine to open an IndexWriter with "create=true" even when IndexReaders are reading that same index. Those open IndexReaders continue to search their point-in-time snapshot, and then whe

Re: update a specific document

2009-06-19 Thread Daan de Wit
Oops, didn't read the OP quite well... 2009/6/19 Anshum : > Exactly, its cleaner but you wouldn't be able to delete on the basis of > Lucene Document ID. > > -- > Anshum Gupta > Naukri Labs! > http://ai-cafe.blogspot.com > > The facts expressed here belong to everybody, the opinions to me. The > d

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-19 Thread Ian Lea
Or have a third master index, as Joel suggests, apply all updates to that index, only, then at the end of each batch index update run, use rsync or equivalent to push the master index out to the 2 search servers and then tell them to reopen their indexes. -- Ian. On Fri, Jun 19, 2009 at 9:23 AM

Re: update a specific document

2009-06-19 Thread João Silva
Hi, thank you all for your answers. I already done the strategies you mencioned by delete first and update or using the internal api updateDocument(term, document). If there was a updateDocument(internalid, theNewDocument) present in the api, this clarified more the process. Thanks again, for you

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-19 Thread Joel Halbert
do they have to be kept in synch in real time? does each server handle writes to its own index which then need to be propagated to the other server's index? From a simplicity point of view, to minimise the amount of self consistency checking that needs to happen I would suggest even having a thi

Re: Lucene performance: is search time linear to the index size?

2009-06-19 Thread Joel Halbert
Hi Kuro, How did you generate your second , larger, test data set? Did you simply copy the original data set multiple times? Or did you use new pseudo-random data (words). If the first then you would expect a linear increase in search time as the number of indexed terms has not changed, just th

Re: update a specific document

2009-06-19 Thread Anshum
Exactly, its cleaner but you wouldn't be able to delete on the basis of Lucene Document ID. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Fri, Jun 19, 2009 at 1:26 PM, Da

Re: update a specific document

2009-06-19 Thread Daan de Wit
There's also IndexWriter#updateDocument(Term, Document) now, to use that one you need to be able to uniquely identify a document using a term (probably with an application-specific id field or something). This method does also delete and readd the document, but it is a somewhat cleaner api. Daan

windows locking file problem

2009-06-19 Thread Malo Pichot
Hi, I know a similar subject has been discussed in this list and this is not a "windows file system" list ;-) But may be someone have encountered the "thing"... and perhaps solved it ! I have a web application that index many documents so I have a quite large Lucene (2.2) index (~ 350 Mo) managed