Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
I am not using the same index with different writers. These are two separate indexes both have their own reader/writer I just wanted to minimize the network load by avoiding the download of an optimized index if the contents are indeed same. --noble On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandl

Merging indexes - which is best option?

2008-09-04 Thread Antony Bowesman
I am creating several temporary batches of indexes to separate indices and periodically will merge those batches to a set of master indices. I'm using IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master may already contain the index for that document and I get a dup

Re: Hits document offset information

2008-09-04 Thread Chris Hostetter
: Now, I would like to to access to the best fragments offsetsfrom each : document (hits.doc(i)). I seem to recall that the recomended method for doing this is to subclass your favorite Formatter and record the information from each TokenGroup before delegating to the super class. but there

Javadoc wording in IndexWriter.addIndexesNoOptimize()

2008-09-04 Thread Antony Bowesman
The Javadoc for this method has the following comment: "This requires this index not be among those to be added, and the upper bound* of those segment doc counts not exceed maxMergeDocs. " What does the second part of that mean, which is especially confusing given that MAX_MERGE_DOCS is depre

Re: Beginner: Specific indexing

2008-09-04 Thread Chris Hostetter
Honestly: your problem doesn't sound like a Lucene problem to me at all ... i would write custom code to cehck your files for the pattern you are looking for. if you find it *then* construct a Document object, and add your 3 fields. I probably wouldn't even use an analyzer. -Hoss

Re: QueryParser vs. BooleanQuery

2008-09-04 Thread 叶双明
Indeed, StandardAnalyzer removing the pluses, so analyse 'c++' to 'c'. QueryParser include Term that been analysed. And BooleanQuery include Term that hasn't been analysed. I think this is the difference between they. 2008/9/4 Ian Lea <[EMAIL PROTECTED]> > Have a look at the index with Luke to

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Paul Elschot
Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: > Sounds like its more in line with what you are looking for. If I > remember correctly, the phrase query factors in the edit distance in > scoring, but the NearSpanQuery will just use the combined idf for > each of the terms in it, so dis

Re: Lucene debug logging?

2008-09-04 Thread Justin Grunau
Daniel, yes, please see my "Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds" thread. - Original Message From: Daniel Naber <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, September 4, 2008 6:10:56 PM Subject:

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.
Anyway it is worth trying (to ensure docs aren't removed between searches).What if running MatchAllDocsQuery or smth similar? Still getting different hits count on query rerun? PS. I'm kinda newbie with Lucene and Lucene API. So don't take my notes too seriously :) On Fri, Sep 5, 2008 at 12:46 AM

Re: Lucene debug logging?

2008-09-04 Thread Michael McCandless
For IndexWriter there's setInfoStream, which logs details about when flushing & merging is happening. Mike Justin Grunau wrote: Is there a way to turn on debug logging / trace logging for Lucene? - To unsubscribe, e-

Re: Lucene debug logging?

2008-09-04 Thread Daniel Naber
On Donnerstag, 4. September 2008, Justin Grunau wrote: > Is there a way to turn on debug logging / trace logging for Lucene? You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do any logging AFAIK. Are you experiencing any problems that you want to diagnose with debugging?

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Justin Grunau
Sorry, I forgot to include the visibility filters: final BooleanQuery visibilityFilter = new BooleanQuery(); visibilityFilter.add(new TermQuery(new Term("isPublic", "true")), Occur.SHOULD); visibilityFilter.add(new TermQuery(

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.
* And what's about visibility filter? * Are you sure no one else accesses IndexReader and modifies index? See reader.maxDocs() to be confident. On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote: > We have some code that uses lucene which has been working perfectly well > fo

Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Justin Grunau
We have some code that uses lucene which has been working perfectly well for several months. Recently, a QA team in our organization has set up a server with a much larger data set than we have ever tested with in the past: the resulting lucene index is about 3G in size. On this particular se

Lucene debug logging?

2008-09-04 Thread Justin Grunau
Is there a way to turn on debug logging / trace logging for Lucene? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie question: using Lucene to index hierarchical information.

2008-09-04 Thread Leonid Maslov
Hi all, Thanks a lot for such a quick reply. Both scenario sounds very well for me. I would like to do my best and try to implement any of them (as the proof of the concept) and then incrementally improve, retest, investigate and rewrite then :) So, from the soap opera to the question part then:

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Mark Miller
Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me

PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Yannis Pavlidis
Hi, I am having an issue when using the PhraseQuery which is best illustrated with this example: I have created 2 documents to emulate URLs. One with a URL of: "http://www.airballoon.com"; and title "air balloon" and the second one with URL "http://www.balloonair.com"; and title: "balloon air".

ramdisks

2008-09-04 Thread Cam Bazz
hello, anyone using ramdisks for storage? there is ramsam and there is also fusion io. but they are kinda expensive. any other alternatives I wonder? Best.

Re: string similarity measures

2008-09-04 Thread mathieu
I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA. On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote: > let me rephrase the problem. I already have a set of bad words. I want to > avoid people inputting typos of the bad words. > for example 'shit'

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Jason Rutherglen
Hi Cam, Thanks! It has not been easy, probably has taken 3 years or so to get this far. At first I thought the new reopen code would be the solution. I used it, but then needed to modify it to do a clone instead of reference the old deleted docs. Then as I iterated, realized that just using re

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread 叶双明
I see now, thanks Michael McCandless, good explain!! 2008/9/4, Michael McCandless <[EMAIL PROTECTED]>: > > > Sorry, I should have said: you must always use the same writer, ie as of > 2.3, while IndexWriter.optimize (or normal segment merging) is running, > under one thread, another thread can use

lucene ram buffering

2008-09-04 Thread Cam Bazz
hello, I was reading the performance optimization guides then I found : writer.setRAMBufferSizeMB() combined with: writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH); this can be used to flush automatically so if the ram buffer size is over a certain limit it will flush. now the question:

Re: string similarity measures

2008-09-04 Thread Cam Bazz
let me rephrase the problem. I already have a set of bad words. I want to avoid people inputting typos of the bad words. for example 'shit' is banned, but someone may enter sh1t. how can i flag those phonetically similar bad words to the marked bad words? Best. On Thu, Sep 4, 2008 at 5:02 PM, Ka

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread Michael McCandless
Sorry, I should have said: you must always use the same writer, ie as of 2.3, while IndexWriter.optimize (or normal segment merging) is running, under one thread, another thread can use that *same* writer to add/delete/update documents, and both are free to make changes to the index. Be

Re: string similarity measures

2008-09-04 Thread Karl Wettin
4 sep 2008 kl. 15.54 skrev Cam Bazz: yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: I would create 1-5 ngram sized shingles and me

Re: string similarity measures

2008-09-04 Thread Cam Bazz
yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: >I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficien

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread 叶双明
I don't agreed with Michael McCandless. :) I konw that after 2.3, add and delete can run in one IndexWriter at one time, and also lucene has a update method which delete documents by term then add the new document. In my test, either LockObtainFailedException with thread sleep sentence: org.apac

Re: Similarity percentage between two Strings

2008-09-04 Thread Karl Wettin
I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. You might want to add more weight the greater the size of the shingle. There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto dist

Re: string similarity measures

2008-09-04 Thread Karl Wettin
4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like sound

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Cam Bazz
Hello Jason, I have been trying to do this for a long time on my own. keep up the good work. What I tried was a document cache using apache collections. and before a indexwrite/delete i would sync the cache with index. I am waiting for lucene 2.4 to proceed. (query by delete) Best. On Wed, Sep

string similarity measures

2008-09-04 Thread Cam Bazz
Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these are

Re: delete/reset the index

2008-09-04 Thread 叶双明
Agree with Michael McCandless!! By that way,it is handling gracefully. 2008/9/4 Michael McCandless <[EMAIL PROTECTED]> > > If you're on Windows, the safest way to do this in general, if there is any > possibility that readers are still using the index, is to create a new > IndexWriter with creat

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Michael McCandless
Thanks for raising it! It's through requests like this that Lucene's API improves. Mike Noble Paul നോബിള്‍ नोब्ळ् wrote: YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Noble Paul നോബിള്‍ नोब्ळ् wrote: On Wed, Sep 3, 2008 at 2:

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Noble Paul നോബിള്‍ नोब्ळ् wrote: > >> On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless >> <[EMAIL PROTECTED]> wrote: >>> >>> Noble Paul നോബിള്‍ नोब्ळ् wrote: >>> On Tue, Sep 2, 20

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Michael McCandless
Noble Paul നോബിള്‍ नोब्ळ् wrote: On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Noble Paul നോബിള്‍ नोब्ळ् wrote: On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Are you thinking this would just fallback to Directory.fileModifi

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread Michael McCandless
Actually, as of 2.3, this is no longer true: merges and optimizing run in the background, and allow add/update/delete documents to run at the same time. I think it's probably best to use application logic (outside of Lucene) to keep track of what updates happened to the master while the

Re: delete/reset the index

2008-09-04 Thread Michael McCandless
If you're on Windows, the safest way to do this in general, if there is any possibility that readers are still using the index, is to create a new IndexWriter with create=true. Windows does not let you remove open files. IndexWriter will gracefully handle failed deletes by retrying them

Re: QueryParser vs. BooleanQuery

2008-09-04 Thread Ian Lea
Have a look at the index with Luke to see what has actually been indexed. StandardAnalyzer may well be removing the pluses, or you may need to escape them. And watch out for case - Visual != visual in term query land. -- Ian. On Thu, Sep 4, 2008 at 9:46 AM, bogdan71 <[EMAIL PROTECTED]> wrote:

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread 叶双明
No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[EMAIL PROTECTED]> > The use case is as follow

QueryParser vs. BooleanQuery

2008-09-04 Thread bogdan71
Hello, I am experiencing a strange behaviour when trying to query the same thing via BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index contains the document: "12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is the ID field, "Visual C++" is th

Re: Pre-filtering for expensive query

2008-09-04 Thread Andrzej Bialecki
Grant Ingersoll wrote: On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote: I think you can use a FilteredQuery in a BooleanClause. This may be faster than the filtering code in the Searcher, because the evaluation is done during scoring and not afterwards. FilteredQuery internally makes

Re: Similarity percentage between two Strings

2008-09-04 Thread Ian Lea
Googling for "java string similarity" throws up some stuff you might find useful. -- Ian. On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote: > > Well, the similar definition that I'm looking for is the number 2, maybe > the number 3, but to start the number 2 is enou

Re: delete/reset the index

2008-09-04 Thread 叶双明
Delete the index Directory in File System, I think this is the simpliest!!! 2008/9/4 simon litwan <[EMAIL PROTECTED]> > hi all > > i would like to delete the the index to allow to start reindexing from > scratch. > is there a way to delete all entries in a index? > > any hint is very appreciated.

delete/reset the index

2008-09-04 Thread simon litwan
hi all i would like to delete the the index to allow to start reindexing from scratch. is there a way to delete all entries in a index? any hint is very appreciated. simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For addi