Re: Unable to find: org.apache.lucene.index.memory.AnalyzerUtil

2009-07-16 Thread Adriano Crestani
Hi, The package org.apache.lucene.index.memory belongs to a contrib jar. Try to add lucene-memory-.jar to your classpath. Regards, Adriano Crestani On Thu, Jul 16, 2009 at 9:23 PM, prashant ullegaddi < prashullega...@gmail.com> wrote: > Hi > > I'm unable to find this class in lucene-core-2.4.1.

Unable to find: org.apache.lucene.index.memory.AnalyzerUtil

2009-07-16 Thread prashant ullegaddi
Hi I'm unable to find this class in lucene-core-2.4.1.jar. Is there other jar file I need to download to get this? Regards, Prashant.

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Erick Erickson
OK, I'm feeling old today. But do any of you kids out there have any idea how miraculous this thread is? In "the bad old days", or "when I was your age", getting to the bottom of a problem like this would have involved on-sited consultants at $150/hour and about 6 months. Assuming that the product

Re: Unable to do exact search with Lucene.

2009-07-16 Thread Erick Erickson
The first thing I'd do is get a copy of Luke and look in my index to see exactly what's there. Nothing in your e-mails indicates that you *should* get any hits. Although I admin not getting jakarta lucene in 50M pages seems unlikely. But Ian's suggestion that you start with a smaller index is spot

Re: .net lucene doubt

2009-07-16 Thread Erick Erickson
Well, if the .net port mimics the java library, look at the Analyzer class. There you'll see a bunch of different language analyzers. Also, look in the contrib section for others. The trick is that you must know what language you're using. Indexing multiple languages in a single index is difficult.

RE: How to get rid of unused fields?

2009-07-16 Thread Chris Hostetter
: The same here, even with trunk from yesterday. If you create a field, it : stays there forever, even after deleting *all* documents from index, : reindexing without the field and optimizing. Uwe: if you have a quick test case already written can you try it against 2.4 (and maybe 2.3) because i

RE: How to get rid of unused fields?

2009-07-16 Thread Uwe Schindler
The same here, even with trunk from yesterday. If you create a field, it stays there forever, even after deleting *all* documents from index, reindexing without the field and optimizing. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de >

Re: How to get rid of unused fields?

2009-07-16 Thread Chris Hostetter
: After deleting documents from the index it can happen that fields become : unused (i.e. no document has this field anymore). And : IndexReader.getFieldNames() still returns these unused fields, even : after optimizing the index. Is there any chance to get rid of these : unused fields? that's od

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
> > How do you handle stop words in phrase queries? ok, good point! You found another item for list of BADs... but not for me as we do not use phrase Qs to be honest, I do not even know how they are implemented... but no, there are no positions in such cache... well, they remain slowe

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
> caching them (as OpenBitSet) How do you handle stop words in phrase queries? On Thu, Jul 16, 2009 at 11:30 AM, eks dev wrote: > > Sure, If you have enough memory to do postings caching, with or without P4... > I see P4 as a generally faster postings format, with stopwords or not. > > I wouldn'

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
Sure, If you have enough memory to do postings caching, with or without P4... I see P4 as a generally faster postings format, with stopwords or not. I wouldn't blow Term dictionary, that just moves the problem to another place. What I am thinking of is quite simple, probably not the most elegan

RE: Search in non-linguistic text

2009-07-16 Thread Digy
Another approach could be splitting the text into chars and returning each char as a token(in a custom analyzer). For ex: for the document [some text] Tokens would be [s] [o] [m] [e] [t] [e] [x] [t] and searches such as [ome] or [ex] would get hits. Sample code written in C# is below: http

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
Do we think that we'll be able to support indexing stop words using PFOR (with relaxation on the compression to gain performance?) Today it seems like the best approach to indexing stop words is to use shingles? However this blows up the term dict because shingles concatenates phrases together. On

Re: searching for c++, c#, etc...

2009-07-16 Thread Chris Salem
I figured "c++." would be a problem. Here's what I did to get around it: value.toLowerCase().replaceAll("\\.( ?\t?\n?\r?)+", " ") I'm not escaping +'s from the query so I should be good there. thanks alot. Sincerely, Chris Salem Development Team Main Sequence Technologies, Inc. PCRecruiter.net -

Re: searching for c++, c#, etc...

2009-07-16 Thread John Wang
If you escape the character + or #, the sentence: "I know java + c++" would not skip +, furthermore, it breaks query parsing, where + is reserved. -John On Thu, Jul 16, 2009 at 9:04 AM, John Wang wrote: > This runs into problems when you have such following sentence: > "I dislike c++." > > If y

Re: searching for c++, c#, etc...

2009-07-16 Thread John Wang
This runs into problems when you have such following sentence: "I dislike c++." If you use WSA, then last token is "c++.", not "c++", the query would not find this document. -John On Thu, Jul 16, 2009 at 8:29 AM, Chris Salem wrote: > That seems to be working. you don't have to escape the plus

Re: searching for c++, c#, etc...

2009-07-16 Thread Chris Salem
That seems to be working. you don't have to escape the pluses though. Also, it appears that the WhitespaceAnalyzer is case sensitive, but I guess I could lowercase everything that gets indexed. thanks alot for your help. Sincerely, Chris Salem Development Team Main Sequence Technologies, In

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
We did it for us, gave something back to community... all happy... open source works just fine here in lucene land :) Re, 10% I did not expect that much, but our index is quite dense, a lot of documents and not too many unique terms, omitTf ... so it is really hard pressure on DocIDSetIterato

Re: searching for c++, c#, etc...

2009-07-16 Thread Danil ŢORIN
Try WhitespaceAnalyzer for both indexing and searching. On search-time you may also need to escape "+", "(", ")" with "\". "#" shouldn't need escaping. On Thu, Jul 16, 2009 at 17:23, Chris Salem wrote: > I'm using the StandardAnalyzer for both searching and indexing. > Here's the code to parse the

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Michael McCandless
Super, thanks for testing! And, the 10% speedup overall is good progress... Mike On Thu, Jul 16, 2009 at 9:16 AM, eks dev wrote: > > and one final touch, 4X slow down does not exist with new Lucene... > I did not verify it again on the old one, but hey, who cares. Trunk is clean > and, at least

Re: searching for c++, c#, etc...

2009-07-16 Thread Chris Salem
I'm using the StandardAnalyzer for both searching and indexing. Here's the code to parse the query: Searcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(stopwords); System.out.println(queryString); QueryParser qp = new QueryParser(searchField,analyzer); Query quer

Re: Anyone used org.apache.lucene.analysis.compound.hyphenation.TernaryTree?

2009-07-16 Thread Grant Ingersoll
No, but I recall some discussion to move it up out of Analysis into a more generally useful place, as it can be appropriate for autosuggest and other things. On Jul 14, 2009, at 7:27 PM, Jason Rutherglen wrote: Just wondering if it works and if it's a good fit for autosuggest?

Re: Ugh

2009-07-16 Thread Matthew Hall
They are upgrading our mail servers here, so if you are seeing.. many MANY duplicates of things I posted.. I'm really sorry about that. T_T Matt -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org

Re: Search in non-linguistic text

2009-07-16 Thread Matthew Hall
Assuming your dataset isn't incredibly large, I think you could.. cheat here, and optimize your data for searching. Am I correct in assuming that BC, should also match on ABCD? If so, then yes your current thoughts on the problems that you face are correct, and everything you do will be turnin

Re: Search in non-linguistic text

2009-07-16 Thread Robert Muir
take a look at WordDelimiterFilter from Solr [you can use it in your lucene app too] On Thu, Jul 16, 2009 at 9:04 AM, JesL wrote: > > Hello, > Are there any suggestions / best practices for using Lucene for searching > non-linguistic text?  What I mean by non-linguistic is that it's not English >

Re: Search in non-linguistic text

2009-07-16 Thread Anshum
Hi Jes,Good to see you here. You could try something like an n'gram analyzer. You'd have to explore, though 'm assuming it'd be helpful for you. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to d

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
and one final touch, 4X slow down does not exist with new Lucene... I did not verify it again on the old one, but hey, who cares. Trunk is clean and, at least so far, our favourite QA team has nothing to complain about ... They will keep it under stress for a while... so if somethings comes up

Search in non-linguistic text

2009-07-16 Thread JesL
Hello, Are there any suggestions / best practices for using Lucene for searching non-linguistic text? What I mean by non-linguistic is that it's not English or any other language, but rather product codes. This is presenting some interesting challenges. Among them are the need for pretty lax wi

Re: Unable to do exact search with Lucene.

2009-07-16 Thread Ian Lea
You might like to start with a smaller index ... There are many suggestions in the "Why am I getting no hits / incorrect hits?" of the Lucene FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ. Maybe if you work through those you'll find the problem. -- Ian. On Thu, Jul 16, 2009 at 1:42 PM,

Re: Unable to do exact search with Lucene.

2009-07-16 Thread prashant ullegaddi
50 million HTML pages (part of clueweb09 dataset for TREC) were indexed using Hadoop into 56 indexes. 56 indexes were merged into a single index. Analyzer is the StandardAnalyzer. On Thu, Jul 16, 2009 at 6:07 PM, Anshum wrote: > Hi Prashant, > > What did you index? how did you index? what anal

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
ok new facts, less chaos :) - LUCENE-1744 fixed it definitely; I have it confirmed Also, we found another example of the Query that was stuck (t1 t2 t3)~2 ... this is also fixed with LUCENE-1744 Re: "some queries are 4X slower than before". Was that a different issue? (Because this issu

Re: Unable to do exact search with Lucene.

2009-07-16 Thread Anshum
Hi Prashant, What did you index? how did you index? what analyzer did you use? without all of these, perhaps it'd be difficult to figure out the issue. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yo

Re: Unable to do exact search with Lucene.

2009-07-16 Thread prashant ullegaddi
Sorry, subject should have been: Unable to do proximity search. Also, how to do exact search in Lucene? ~ Prashant On Thu, Jul 16, 2009 at 6:04 PM, prashant ullegaddi < prashullega...@gmail.com> wrote: > Hi, > > I tried searching: > "Apache Jakarta"~10 > > Nothing was returned. What might be w

Unable to do exact search with Lucene.

2009-07-16 Thread prashant ullegaddi
Hi, I tried searching: "Apache Jakarta"~10 Nothing was returned. What might be wrong? Regards, Prashant.

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Michael McCandless
On Thu, Jul 16, 2009 at 6:38 AM, eks dev wrote: > and this String has exactly that form > (x OR y OR z) OR (a OR b OR c), > That is exactly how I construct the Query, have a look at brackets on this > toString result . Duh! OK, I had missed that your large query actually had 2 clauses at the to

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
I am getting lost as well, maybe I managed to confuse myself and everybody else here. But all agree, it would be good to know why it works now Re. Query rewriting. This Query gets printed with /// BooleanQuery q; q.toString() search(q, null, 200): /// => this is the Query that enters

Re: Lucene problem:No document handler defined for the name "test"

2009-07-16 Thread Pablo Mosquera Saenz
Ok, thanks, I will try in the spring users mailing list 2009/7/16 Simon Willnauer > I guess you will get much more help on the spring mailinglist than you > will get from java-users. > you problem is related to your configuration and not to lucene as far > as I can tell. > > simon > > On Thu, Ju

Re: Lucene problem:No document handler defined for the name "test"

2009-07-16 Thread Simon Willnauer
I guess you will get much more help on the spring mailinglist than you will get from java-users. you problem is related to your configuration and not to lucene as far as I can tell. simon On Thu, Jul 16, 2009 at 12:20 PM, Pablo Mosquera Saenz wrote: > Hi, I have downloaded the springmodule for lu

Lucene problem:No document handler defined for the name "test"

2009-07-16 Thread Pablo Mosquera Saenz
Hi, I have downloaded the springmodule for lucene, version 0.9 and tried to test the sample I have used the lucene core library 2.4.1 The first problem I found is that with the initial configuration With SingleSearcherFactory, in the startup I have an error because ther

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Michael McCandless
On Thu, Jul 16, 2009 at 5:21 AM, eks dev wrote: > Trace taken on trunk version (with fixed Yonik's bug and LUCENE-1744 tha > fixed the problem somehow) Whoa, so LUCENE-1744 did in fact fix the problem? (I thought you had accidentally failed to setAllowDocsOutOfOrder(true) and that made us false

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread eks dev
Trace taken on trunk version (with fixed Yonik's bug and LUCENE-1744 tha fixed the problem somehow) full trace is too big (3.5Mb for this list), therefore only beginning and end: Query: +(((NAME:maria NAME:marae^0.25171682 NAME:marai^0.2365632 NAME:marao^0.2365632 NAME:marau^0.2365632 NAME:mar

Re: searching for c++, c#, etc...

2009-07-16 Thread Ian Lea
Hi Escaping should work. See http://lucene.apache.org/java/2_4_1/queryparsersyntax.html and QueryParser.escape(). And you need to be sure that your analyzer isn't removing the plus signs and that you use the same analyzer for indexing and searching. Googling for something like "lucene escape"