Re: SpanQuery for Terms at same position

2009-11-22 Thread Adriano Crestani
You are right Paul, 0 would not work, probably something less than zero, as Paul suggested. Give it a try and tell us if it worked ; ) On Sun, Nov 22, 2009 at 9:50 AM, Paul Elschot wrote: > Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: > > Hi, > > > > I didn't test, but you might

Re: did you mean issue

2009-11-22 Thread Grant Ingersoll
How are you invoking the spell checker? On Nov 19, 2009, at 1:22 AM, m.harig wrote: > > hello all > > i've a doubt in spell checker , when i search for a keyword hoem > am getting the spell results as in the following order (in which am > retrieving 4 suggested words) > > form > hol

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
To call clear, you can always downcast to AttributeImpl. But you need to know, that it may clear also other attributes (like if it is a Token). So setting termLength to 0 is the fastest approach, if you only need the term att. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.the

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Ok I see you fixed it at the same time I sent the email :). I think I get it ... so far. So far I had to cache just TermAttribute. I think it'll get messy when I'll need to cache more, like Type and PositionIncrement. But I haven't reached those yet. Perhaps instead of creating many types of clon

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
I assume termAtt is the input's TermAttribute, right? Therefore it has no copyTo ... What I've done so far is create a TermAttribute like you proposed (fixed from my previous TermAttributeImpl): TermAttribute clone = (TermAttribute) input.getAttributeFactory().createAttributeInstance(TermAttribut

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
Sorry small error: Class Initializer: private final AttributeSource lastState = cloneAttributes(); private final TermAttribute lastTermAtt = lastState.addAttribute(TermAttribute.class); incrementToken: if (input.incrementToken()) { if (lastTermAtt.checkSomethingAsYouProposed) {

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
The cast to TermAttributeImpl may not work if the factory creates a Token... So declare termBuf as TermAttribute (without impl). To clear, you can always downcast the interface to AttributeImpl. Or create a second variable. Alternatively use my second approach. - Uwe Schindler H.-H.-Meier-All

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Did you mean something like: TermAttributeImpl termBuf = (TermAttributeImpl) input.getAttributeFactory().createAttributeInstance(TermAttribute.class); I need to use the methods on TermAttributeImpl like clear() ... Shai On Sun, Nov 22, 2009 at 9:03 PM, Uwe Schindler wrote: > I said, you *coul

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
Another idea, what you can also do is, create an AttributeSource instance in your TokenStream one time using the AttributeSource.cloneAttributes() call. You can use this copy of the attributes in parallel and maybe update the TermAttribute there and so on. If you want to look at the last token, jus

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
I said, you *could* if it would be exposed. But the State is a holder class without functionality. Because the internals are impl dependent, maybe we will add such thing in future. But: If the state contains a real map, it would be slow, because each captureState call would need to fill the map, wh

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Yes I can clone the term itself by instantiating a TermAttributeImpl, which is better than storing the String, because the latter always allocates char[], while the former will reuse the char[] if it's big enough. What if State included a HashMap of all attributes, in addition to its "linked-list"

Re: Efficient filtering advise

2009-11-22 Thread Erick Erickson
Hmmm, could you show us what you do in your collector? Because one of the gotchas about a collector is loading the documents in the inner loop. Quick test: comment out whatever you're doing in the underlying collector loop, and see if there's *any* noticeable difference in speed. That'll tell you w

RE: Top field count scoring across documents

2009-11-22 Thread Peter 4U
Hi Jake, Many thanks for your quick reply. I shall check these out. Thanks! Peter > Date: Sun, 22 Nov 2009 09:20:24 -0800 > Subject: Re: Top field count scoring across documents > From: jake.man...@gmail.com > To: java-user@lucene.apache.org > > Peter, > > You want to do a facet qu

Re: Efficient filtering advise

2009-11-22 Thread Paul Elschot
Op zondag 22 november 2009 17:23:53 schreef Eran Sevi: > Thanks for the tips. > > I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but > I'll definitely try to re-group the the terms that are not changing in order > to cache them. > How can I join several such filters togethe

Re: Top field count scoring across documents

2009-11-22 Thread Jake Mannix
Peter, You want to do a facet query. This kind of functionality is not in Lucene-core (sadly), but both Solr (the fully featured search application built on Lucene) and bobo-browse (just a library, like Lucene itself) are open-source and work with Lucene to provide faceting capabilities for yo

Top field count scoring across documents

2009-11-22 Thread Peter 4U
Hello Lucene Experts, I wonder if someone might be able to shed some insight on this interesting scoring question: The problem: Build a search query that will return [ordered] hits by the top number of occurences of field values across matched documents (or as close to this as possible). Th

Re: Efficient filtering advise

2009-11-22 Thread Eran Sevi
I think it shouldn't take X5 times longer since the number of results is only about X2 times larger (and much smaller than the number of terms in the filter), but maybe I'm wrong here since I'm not familiar with the filter internals. Unfortunately, the time to construct the filter is mere millisec

Re: Efficient filtering advise

2009-11-22 Thread Erick Erickson
Hmmm, I'm not very clear here. Are you saying that you effectively form 10-50K filters and OR them all together? That would be consistent with the 50K case taking approx. 5X a long as the 10K case. Do you know where in your code the time is being spent? That'd be a big help in suggesting alter

Re: Efficient filtering advise

2009-11-22 Thread Eran Sevi
Thanks for the tips. I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but I'll definitely try to re-group the the terms that are not changing in order to cache them. How can I join several such filters together? Using FieldCacheTermsFilter sounds promising. Fortunately it is

RE: Efficient filtering advise

2009-11-22 Thread Uwe Schindler
Maybe this helps you, but read the docs, it will work only with single-value-fields: http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC acheTermsFilter.html Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > --

Re: Efficient filtering advise

2009-11-22 Thread Paul Elschot
Try a MultiTermQueryWrapperFilter instead of the QueryFilter. I'd expect a modest gain in performance. In case it is possible to form a few groups of terms that are reused, it could even be more efficient to also use a CachingWrapperFilter for each of these groups. Regards, Paul Elschot Op zonda

Efficient filtering advise

2009-11-22 Thread Eran Sevi
Hi, I have a need to filter my queries using a rather large subset of terms (can be 10K or even 50K). All these terms are sure to exist in the index so the number of results can be about the same number of terms in the filter. The terms are numbers but are not subsequent and are from a large set o

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
> Because that'd mean I'll check for abbreviations for every token. Which is > a > big performance loss. That way, I can just check abbr if I encountered a > "." > (not even all end-of-sentence tokens). OK, than simply copy the term to a String and store it. The cost is the same like cloning/copyi

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Because that'd mean I'll check for abbreviations for every token. Which is a big performance loss. That way, I can just check abbr if I encountered a "." (not even all end-of-sentence tokens). Why can't State offer a "getAttribute" like AttributeSource? Shai On Sun, Nov 22, 2009 at 4:34 PM, Uwe

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
If you just want to lookup if "Mr" is an abbreviation, why not look it up when you handle that token and set a boolean variable in the TS (lastTokenWasAbbreviation). When you process the ".", remove it if the Boolean is set. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
What I've done is: State state = in.captureState(); ... // Upon new call to incrementToken(). State tmp = in.captureState(); in.restoreState(state); // check if termAttribute is an abbreviation. If not : in.restoreState(tmp); But seems a lot of capturing/restoring to me ... how expensive is that?

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Perhaps I misunderstand something. The current use case I'm trying to solve is - I have an abbreviations TokenFilter which reads a token and stores it. If the next token is end-of-sentence, it checks whether the previous one is in the abbreviations list, and discards the end-of-sentence token. I ne

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
Use captureState and save the state somewhere. You can restore the state with restoreState to the TokenStream. CachingTokenFilter does this. So the new API uses the State object to put away tokens for later reference. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
ok so from what I understand, I should stop working w/ Token, and move to working w/ the Attributes. addAttribute indeed does not work. Even though it does not through an exception, if I call in.addAttribute(Token.class), I get a new instance of Token and not the once that was added by in. So this

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
> But I do use addAttribute(Token.class), so I don't understand why you say > it's not possible. And I completely don't understand why the new API > allows > me to just work w/ interfaces and not impls ... A while ago I got the > impression that we're trying to get rid of interfaces because they're

Re: SpanQuery for Terms at same position

2009-11-22 Thread Paul Elschot
Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: > Hi, > > I didn't test, but you might want to try SpanNearQuery and set slop to zero. > Give it a try and let me know if it worked. The slop is the number of positions "in between", so zero would still be too much to only match at the

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
But I do use addAttribute(Token.class), so I don't understand why you say it's not possible. And I completely don't understand why the new API allows me to just work w/ interfaces and not impls ... A while ago I got the impression that we're trying to get rid of interfaces because they're not easy

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
> > I want to add Token.class, and then work w/ Token. Not TermAttribute, > PosIncrAttribute, OffsetAttribute, PayloadAttribute and TypeAttribute > (these > are the five attributes I'm using from Token). Why can't the code add > Token > to the attributes map? If all of these are anyway mapped to t

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
> I started to migrate my Analyzers, Tokenizer, TokenStreams and > TokenFilters > to the new API. Since the entire set of classes handled Token before, I > decided to not change it for now, and was happy to discover that Token > extends AttributeImpl, which makes the migration easier. > > So I sta

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Thanks Uwe for the response, however that doesn't get me anywhere. I already know that Token is added once, and that after I add Token I cannot add more of them. And I understand why the double printing. I want to add Token.class, and then work w/ Token. Not TermAttribute, PosIncrAttribute, Offset

RE: How to deal with Token in the new TS API

2009-11-22 Thread Uwe Schindler
> To add to my previous email, If I do the following: > > StringReader sr = new StringReader("hello world"); > TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY, > sr); > > for (Iterator> iter = > ts.getAttributeClassesIterator(); iter.hasNext();) { > Class< ? extends Attri

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
To add to my previous email, If I do the following: StringReader sr = new StringReader("hello world"); TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY, sr); for (Iterator> iter = ts.getAttributeClassesIterator(); iter.hasNext();) { Class< ? extends Attribute> type = iter.

How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Hi I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters to the new API. Since the entire set of classes handled Token before, I decided to not change it for now, and was happy to discover that Token extends AttributeImpl, which makes the migration easier. So I started w/ my