I am not using the same index with different writers. These are two
separate indexes both have their own reader/writer
I just wanted to minimize the network load by avoiding the download of
an optimized index if the contents are indeed same.
--noble
On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandl
I am creating several temporary batches of indexes to separate indices and
periodically will merge those batches to a set of master indices. I'm using
IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master
may already contain the index for that document and I get a dup
: Now, I would like to to access to the best fragments offsetsfrom each
: document (hits.doc(i)).
I seem to recall that the recomended method for doing this is to subclass
your favorite Formatter and record the information from each TokenGroup
before delegating to the super class.
but there
The Javadoc for this method has the following comment:
"This requires this index not be among those to be added, and the upper bound*
of those segment doc counts not exceed maxMergeDocs. "
What does the second part of that mean, which is especially confusing given that
MAX_MERGE_DOCS is depre
Honestly: your problem doesn't sound like a Lucene problem to me at all
... i would write custom code to cehck your files for the pattern you are
looking for. if you find it *then* construct a Document object, and add
your 3 fields. I probably wouldn't even use an analyzer.
-Hoss
Indeed, StandardAnalyzer removing the pluses, so analyse 'c++' to 'c'.
QueryParser include Term that been analysed.
And BooleanQuery include Term that hasn't been analysed.
I think this is the difference between they.
2008/9/4 Ian Lea <[EMAIL PROTECTED]>
> Have a look at the index with Luke to
Op Thursday 04 September 2008 20:39:13 schreef Mark Miller:
> Sounds like its more in line with what you are looking for. If I
> remember correctly, the phrase query factors in the edit distance in
> scoring, but the NearSpanQuery will just use the combined idf for
> each of the terms in it, so dis
Daniel, yes, please see my "Problem with lucene search starting to return 0
hits when a few seconds earlier it was returning hundreds" thread.
- Original Message
From: Daniel Naber <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, September 4, 2008 6:10:56 PM
Subject:
Anyway it is worth trying (to ensure docs aren't removed between searches).What
if running MatchAllDocsQuery or smth similar? Still getting different hits
count on query rerun?
PS. I'm kinda newbie with Lucene and Lucene API. So don't take my notes too
seriously :)
On Fri, Sep 5, 2008 at 12:46 AM
For IndexWriter there's setInfoStream, which logs details about when
flushing & merging is happening.
Mike
Justin Grunau wrote:
Is there a way to turn on debug logging / trace logging for Lucene?
-
To unsubscribe, e-
On Donnerstag, 4. September 2008, Justin Grunau wrote:
> Is there a way to turn on debug logging / trace logging for Lucene?
You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do
any logging AFAIK. Are you experiencing any problems that you want to
diagnose with debugging?
Sorry, I forgot to include the visibility filters:
final BooleanQuery visibilityFilter = new BooleanQuery();
visibilityFilter.add(new TermQuery(new Term("isPublic",
"true")),
Occur.SHOULD);
visibilityFilter.add(new TermQuery(
* And what's about visibility filter? * Are you sure no one else accesses
IndexReader and modifies index? See reader.maxDocs() to be confident.
On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau <[EMAIL PROTECTED]> wrote:
> We have some code that uses lucene which has been working perfectly well
> fo
We have some code that uses lucene which has been working perfectly well for
several months.
Recently, a QA team in our organization has set up a server with a much larger
data set than we have ever tested with in the past: the resulting lucene index
is about 3G in size.
On this particular se
Is there a way to turn on debug logging / trace logging for Lucene?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi all,
Thanks a lot for such a quick reply.
Both scenario sounds very well for me. I would like to do my best and try to
implement any of them (as the proof of the concept) and then incrementally
improve, retest, investigate and rewrite then :)
So, from the soap opera to the question part then:
Sounds like its more in line with what you are looking for. If I
remember correctly, the phrase query factors in the edit distance in
scoring, but the NearSpanQuery will just use the combined idf for each
of the terms in it, so distance shouldnt matter with spans (I'm sure
Paul will correct me
Hi,
I am having an issue when using the PhraseQuery which is best illustrated with
this example:
I have created 2 documents to emulate URLs. One with a URL of:
"http://www.airballoon.com"; and title "air balloon" and the second one with URL
"http://www.balloonair.com"; and title: "balloon air".
hello,
anyone using ramdisks for storage? there is ramsam and there is also fusion
io. but they are kinda expensive. any other alternatives I wonder?
Best.
I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA.
On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote:
> let me rephrase the problem. I already have a set of bad words. I want to
> avoid people inputting typos of the bad words.
> for example 'shit'
Hi Cam,
Thanks! It has not been easy, probably has taken 3 years or so to get
this far. At first I thought the new reopen code would be the
solution. I used it, but then needed to modify it to do a clone
instead of reference the old deleted docs. Then as I iterated,
realized that just using re
I see now, thanks Michael McCandless, good explain!!
2008/9/4, Michael McCandless <[EMAIL PROTECTED]>:
>
>
> Sorry, I should have said: you must always use the same writer, ie as of
> 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> under one thread, another thread can use
hello,
I was reading the performance optimization guides then I found :
writer.setRAMBufferSizeMB()
combined with: writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);
this can be used to flush automatically so if the ram buffer size is over a
certain limit it will flush.
now the question:
let me rephrase the problem. I already have a set of bad words. I want to
avoid people inputting typos of the bad words.
for example 'shit' is banned, but someone may enter sh1t.
how can i flag those phonetically similar bad words to the marked bad words?
Best.
On Thu, Sep 4, 2008 at 5:02 PM, Ka
Sorry, I should have said: you must always use the same writer, ie as
of 2.3, while IndexWriter.optimize (or normal segment merging) is
running, under one thread, another thread can use that *same* writer
to add/delete/update documents, and both are free to make changes to
the index.
Be
4 sep 2008 kl. 15.54 skrev Cam Bazz:
yes, I already have a system for users reporting words. they fall on
an
operator screen and if operator approves, or if 3 other people
marked it as
curse, then it is filtered.
in the other thread you wrote:
I would create 1-5 ngram sized shingles and me
yes, I already have a system for users reporting words. they fall on an
operator screen and if operator approves, or if 3 other people marked it as
curse, then it is filtered.
in the other thread you wrote:
>I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficien
I don't agreed with Michael McCandless. :)
I konw that after 2.3, add and delete can run in one IndexWriter at one
time, and also lucene has a update method which delete documents by term
then add the new document.
In my test, either LockObtainFailedException with thread sleep sentence:
org.apac
I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. You
might want to add more weight the greater the size of the shingle.
There are shingle filters in lucene/java/contrib/analyzers and there
is a Tanimoto dist
4 sep 2008 kl. 14.38 skrev Cam Bazz:
Hello,
This came up before but - if we were to make a swear word filter,
string
edit distances are no good. for example words like `shot` is
confused with
`shit`. there is also problem with words like hitchcock. appearently
i need
something like sound
Hello Jason,
I have been trying to do this for a long time on my own. keep up the good
work.
What I tried was a document cache using apache collections. and before a
indexwrite/delete i would sync the cache with index.
I am waiting for lucene 2.4 to proceed. (query by delete)
Best.
On Wed, Sep
Hello,
This came up before but - if we were to make a swear word filter, string
edit distances are no good. for example words like `shot` is confused with
`shit`. there is also problem with words like hitchcock. appearently i need
something like soundex or double metaphone. the thing is - these are
Agree with Michael McCandless!! By that way,it is handling gracefully.
2008/9/4 Michael McCandless <[EMAIL PROTECTED]>
>
> If you're on Windows, the safest way to do this in general, if there is any
> possibility that readers are still using the index, is to create a new
> IndexWriter with creat
Thanks for raising it!
It's through requests like this that Lucene's API improves.
Mike
Noble Paul നോബിള് नोब्ळ् wrote:
YOU ARE FAST
thanks.
--Noble
On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Noble Paul നോബിള് नोब्ळ् wrote:
On Wed, Sep 3, 2008 at 2:
YOU ARE FAST
thanks.
--Noble
On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Noble Paul നോബിള് नोब्ळ् wrote:
>
>> On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Noble Paul നോബിള് नोब्ळ् wrote:
>>>
On Tue, Sep 2, 20
Noble Paul നോബിള് नोब्ळ् wrote:
On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Noble Paul നോബിള് नोब्ळ् wrote:
On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Are you thinking this would just fallback to
Directory.fileModifi
Actually, as of 2.3, this is no longer true: merges and optimizing run
in the background, and allow add/update/delete documents to run at the
same time.
I think it's probably best to use application logic (outside of
Lucene) to keep track of what updates happened to the master while the
If you're on Windows, the safest way to do this in general, if there
is any possibility that readers are still using the index, is to
create a new IndexWriter with create=true. Windows does not let you
remove open files. IndexWriter will gracefully handle failed deletes
by retrying them
Have a look at the index with Luke to see what has actually been
indexed. StandardAnalyzer may well be removing the pluses, or you may
need to escape them. And watch out for case - Visual != visual in
term query land.
--
Ian.
On Thu, Sep 4, 2008 at 9:46 AM, bogdan71 <[EMAIL PROTECTED]> wrote:
No documents can added into index when the index is optimizing, or
optimizing can't run durling documents adding to the index.
So, without other error, I think we can beleive the two index are indeed the
same.
:)
2008/9/4 Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]>
> The use case is as follow
Hello,
I am experiencing a strange behaviour when trying to query the same thing
via
BooleanQuery vs. via the know-it-all QueryParser class. Precisely, the index
contains
the document:
"12,Visual C++,4.2" with the field layout: ID,name,version(thus, "12" is
the ID field, "Visual C++"
is th
Grant Ingersoll wrote:
On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote:
I think you can use a FilteredQuery in a BooleanClause. This may be
faster than the filtering code in the Searcher, because the evaluation
is done during scoring and not afterwards. FilteredQuery internally makes
Googling for "java string similarity" throws up some stuff you might
find useful.
--
Ian.
On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote:
>
> Well, the similar definition that I'm looking for is the number 2, maybe
> the number 3, but to start the number 2 is enou
Delete the index Directory in File System, I think this is the simpliest!!!
2008/9/4 simon litwan <[EMAIL PROTECTED]>
> hi all
>
> i would like to delete the the index to allow to start reindexing from
> scratch.
> is there a way to delete all entries in a index?
>
> any hint is very appreciated.
hi all
i would like to delete the the index to allow to start reindexing from
scratch.
is there a way to delete all entries in a index?
any hint is very appreciated.
simon
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For addi
45 matches
Mail list logo