Deleting from distributed index

2007-07-08 Thread Amadeous

Dear All
I want to use Lucene in a multi-server architecture. The built-in rmi
implementation of Lucene helped me to search in my servers concurrently: I
have a gateway machine, and queries are handed to this machine. Then it asks
the query from server machines and returns the aggregative results to the
requester.
Now for something like "deleting from indexes" I want to design a similar
distributed mechanism, with a center gateway. For example I want to delete
all documents returned by a special query, from all the indexing machines.
As I know, Lucene do not offer any special api for this purpose. Do you have
any idea? Is there any pattern or common approach for these needs? If there
is a relative post on this forum(I didn't find), please let me know.
Thanks in advance and best regards
-- 
View this message in context: 
http://www.nabble.com/Deleting-from-distributed-index-tf4043814.html#a11486721
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

2007-07-08 Thread Jong Kim
Hi,
 
The MoreLikeThis class in Lucene's contrib/queries project performs noise
word filtering based on the case-sensitive comparison of the terms against
the user-supplied stopwords set. 
 
I need this comparison to be case-insensitive, but I don't see any way of
achieving it by extending this class. I would have created a subclass of
MoreLikeThis and override the isNoiseWord() method. However, the problem is
that, neither isNoiseWord() method nor the instance variables referenced
inside that method are declared protected. They are all private. Has anyone
solved this problem without modifying and building MoreLikeThis class
directly?
 
An alternative approach would be to supply a stopwords list containing all
variants of the stop words with all possible mixed cases. Needless to say,
that isn't likely to be a workable solution for many.
 
Ultimately it would be nice if those methods and variables would have been
made protected so that applications could override some of the default
behaviors without having to modify the class directly.
 
Any help would be appreciated.
 
Thanks
/Jong


Re: problems with deleteDocuments

2007-07-08 Thread Erick Erickson

First, let me say that I ran a few tests to determine the behavior, so
it's entirely possible someone who actually understands the code will
tell me I'm all wet.

The problem here is that for every scenario in which deleting on partial
field matches would be good, I can create one where it would be bad. Or
worse, disastrous 

The issue isn't deleting multiple documents at once, it's deleting on
partial term matches. In your example, you could have, say, two
IDs for the attachments, one the unique ID and one the parent ID. This
would allow you to remove all the attachments in one delete, and use
a separate delete for the original. Or you could have a "group" id
for an  e-mail and it's attachments, and delete them all with a single
delete on the "group" term. Or.

But I'd *much* rather have a system that required me to jump through
a hoop when I wanted this kind of behavior than have a system that
allowed me to shoot myself in the foot. Or blow off the whole leg. And
deleting on, say, a text field with a partial term match on "the" is just
too terrible to contemplate 

Best
Erick


On 7/8/07, Nadav Har'El <[EMAIL PROTECTED]> wrote:


On Wed, Jul 04, 2007, Erick Erickson wrote about "Re: problems with
deleteDocuments":
> Consider what would happen otherwise. Say you have documents
> with the following values for a field (call it blah).
> some data
> some data I put in the index
> lots of data
> data
>
> Then I don't want deleting on the term blah:data to remove all
> of them. Which seems to be what you're asking. Even if
> you restricted things to "phrases", then deleting on the term
> 'blah:some data' would remove two documents.
>
> So, while UN_TOKENIZED isn't a *requirement*, exact total term
> matches *is* the requirement. By that, I meant that whatever
> goes into the field must not be broken into pieces by the indexing
> tokenizer for deletes to work as you expect.

I disagree, and frankly, am very surprised that "exact total term matches"
is
actually a requirement (I never tried it, so you may be absolutely right,
I
just hope you aren't).

Let me give you just one example where id fields containing multiple
words,
and the ability for a delete query to match several documents, are useful.

Consider an application for indexing emails with attachments. The email
text,
and each document attachment, is indexed as a separate document. When an
email is deleted, we also need to delete its attachments. How shall we do
this? One simple implementation is to have an "id" field for each
document;
The email text document will have a unique id, and the attachment document
will have two ids: its own unique id, and the containing email's id. When
we need to remove an email and all its attachments, we just remove all
documents that match the email's id - and this will include the main text
and the attachments.

By the way, the method is called "deleteDocuments" - doesn't that imply
that it's perfectly acceptable to delete many documents with one term?


--
Nadav Har'El|  Sunday, Jul  8 2007, 22 Tammuz
5767
IBM Haifa Research
Lab  |-
|I am not a complete idiot - some
parts
http://nadav.harel.org.il   |are missing.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Using KeywordAnalyzer with stop word filter

2007-07-08 Thread Kai Weber
Hello,

I want to parse my query string as follows:

* filter out stop words (from GermanAnalyzer)
* ignore every string field:foo
* search words exactly as written (key_word not "key word") on a certain
field

Example (using english words):
Querystring: how cool is a crazyanalyzer -test -baz:foo bar
Resulting query: field:cool field:crazyanalyzer -field:test field:bar

Is there a way to stack Analyzers together and use them for query
parsing? How would I do this?

Regards, Kai

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

2007-07-08 Thread Chris Hostetter

: I need this comparison to be case-insensitive, but I don't see any way of
: achieving it by extending this class. I would have created a subclass of
: MoreLikeThis and override the isNoiseWord() method. However, the problem is
: that, neither isNoiseWord() method nor the instance variables referenced
: inside that method are declared protected. They are all private. Has anyone

a more direct problem for you in this appraoch would be that the entire
class is final.

: solved this problem without modifying and building MoreLikeThis class
: directly?
:
: An alternative approach would be to supply a stopwords list containing all
: variants of the stop words with all possible mixed cases. Needless to say,

it looks like MLT doesn't do any processing of the Set you pass to
setStopWords ... the only method it calls is contains(String) so as long
as you only ever put lower case terms in your set you could easily do
something like...

   Set stopWords = new HashSet() {
 public boolean contains(Object o) {
   return super.contains(o.toString().toLowerCase());
 }
   }

: Ultimately it would be nice if those methods and variables would have been
: made protected so that applications could override some of the default
: behaviors without having to modify the class directly.

agreed ... patches welcome :)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: product based term combination for BooleanQuery?

2007-07-08 Thread Chris Hostetter

: At index time, I used a per document boost (over all fields) and a per
: field bost (over all documents). I can certainly factor out the first
: into a query boost, but I was under the impression that if I ever wanted
: to combine fields (eg to index all "name" "alias" and "title" data in a
: single "head" field) then I had to pre-boost the data prior to combining

whoa, whoa, WHOA! ... not at ALL ... I'm not sure how you got that
impression, but when combining differnet pieces of source data into single
field Lucene has no idea where those differnet peices come from --
boosting a "title" field has no impact whatsoever on a "head" field just
because you happen to put the same piece of text in both "title" and
"head"

furthermore, field boosts apply to the entire field value, if you are
making a "head" field containing some text you think of as title and some
text you think of as "name" you can't set a boost just on the "title" part
of the "head" field.

as i said -- loose those field boosts and you hsould see a *big*
improcement ... in general, i would advise against any attempt to combine
differnet ideas into a single field for the purpose of improving relevancy
... the only reason i would ever take something like a "title" and an
"author" and combine them into a single field is to make hte quering
simpler/faster, not in an attempt to improve relevancy ... query lots of
seperate fields using unique query time boosts.

: it. I tend to believe that these (short) fields contain more relevant
: information than (long) wikipedia articles or other documents.

: Should idf and tf take care of that short/long quality distinction? It
: sounds like you feel they should.

tf/idf will take care of recognizing that the word "John" is relaly
common, so it's not as significant to the query as "Bush" ... the
lengthNorm function of Similarity is what will help score fields better
then longer fields.

: I'll build an index without the per field boost and see if that produces
: improved results.

try the DisjunctionMaxQuery too .. particularly if you have multiword
queries.  the DisMaxQueryParser in solr thta i mentioned before can be
very handy.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Too Many Open files Exception

2007-07-08 Thread Chris Hostetter

: Issuing a "limit descriptors", I see that I have it set to 1024

: In the directory that I'm getting this particular error: 3
: I have 24 different index directories... I think the most I saw at that
: particular time in any one index was 20

as i said ... it doesn't matter where in the code your encounter the
error, or what directory the line of code that throws the errors happens
to be operating on, what matters is the totla number of file handles in
use by the process.

since you didn't tel us what the total number if files for all of your 24
index is, let's assume it 15 ... 15*24 is 360 ... assuming you open new
readers before closing the olde ones you might have as many as 720 files
open at once just considering the searching aspects of the index files ...
thta doesn't even count filehanldes open involved in writting to the index
directories.

and we're still just talking about index files, there's all the jars/class
files your process has to open, plus any network connections.  Solr for
example, on startup while sitting idle and not counting the files in the
index directory has 64 files open ... thta's just jars and basic shared
libraries the JVM needs.

it wouldn't supprise me if 1024 was way to low for an application
maintaining 24 different indexes and dealing with concurrent network
connections.

: : have you tried running lsof on your processes to see all the
: : files you have open?
: I'm not too familiar with this command.  Do I need to issue this command

lsof -p [pid] | wc -l

The number that comes back is the one that will cuase you problems once it
approaches your open files limit.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Too Many Open files Exception

2007-07-08 Thread Chris Hostetter

: Ok... after spending time looking at the code... I see that a method is
: not closing a TokenStream in one of the classes (a class that is
: instantiated quite often) - I would imagine this could quite possibly be
: the culprit?

can you be more specific about the code in question?

I'm not sure what class is responsible for closing TokenStreams when
documents get added ... but the only way i can imagine this might possible
be causing you problems is if you are using Field instances built from a
Reader (because failure to close the TokenStream *might* mean failure to
close the Reader -- but i'm just guessing there)

if you aren't constructing your Document's using Fields built with
Readers, then even if there is a bug with closing TokenStreams it isn't
causing your problem.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search that supports all valid characters in a Unix filename

2007-07-08 Thread Ed Murray


Could
someone let me know the best Analyzer to use to get an exact match on a Unix
filename when it is inserted into an untokened field. 


Filenames
obviously contain spaces and forward slashes along with other characters. I am 
using
a WhitespaceAnalyzer but when the query is parsed it is
chopped into different keywords as such: 


I have
tried several different Analyzers but I can’t seem to get what I want.






Filename:
/repository/Administration/780 IT Support/filegate.txt 
Query:URL:/repository/Administration/780
URL:IT URL:Support/filegate.txt






I am
assuming that this would be in common usage with Lucene but there does not seem
to be an easy way to do it.