Re: Redundant fields Token class?

2009-11-13 Thread Babak Farhang
> i think you want to adjust the end offset of the first output token, > and the start offset of the second. Makes sense. Thanks so much. After thinking about this a bit more it seems I should think of the contents of a Token's termBuffer simply as an index (or key) into the region of text defin

Re: Redundant fields Token class?

2009-11-13 Thread Robert Muir
Babak if your filter splits a token into two output tokens, i think you want to adjust the end offset of the first output token, and the start offset of the second. Babak, for a fairly simple example of this, you can look at the ThaiWordFilter in the lucene contrib-analyzers package. it has to br

Re: Redundant fields Token class?

2009-11-13 Thread Babak Farhang
Thanks for your explanations. I think I have a basic understanding now. What I'm not so sure about, now, is how to decide on the start and ending offsets when the TokenFilter implementation wants to break an input token into subtokens. Should the offsets of the emitted subtokens be the same as the

Re: Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Otis Gospodnetic
Hello, Also keep in mind prefix queries are not the cheapest. Plug: We've seen people use this successfully: http://www.sematext.com/products/autocomplete/index.html I believe somebody is trying this out with a set of 1B suggestions. The demo at http://www.sematext.com/demo/ac/index.html search

Re: OutofMemory in large index

2009-11-13 Thread Otis Gospodnetic
Hello, Comments inlined. - Original Message > From: vsevel > To: java-user@lucene.apache.org > Sent: Fri, November 13, 2009 11:32:02 AM > Subject: Re: OutofMemory in large index > > > Hi, I am jumping into the thread because I have got a similar issue. > My index is 30Gb large and

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch wrote: > Well already, without doing any boosting, documents matching more of the > > terms > > in your query will score higher. If you really want to make this effect > > more > > pronounced, yes, you can boost the more important query terms higher. >

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
Well already, without doing any boosting, documents matching more of the > terms > in your query will score higher. If you really want to make this effect > more > pronounced, yes, you can boost the more important query terms higher. > > -jake > But there isn't a way to determine exactly what bo

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch wrote: > > > Now, I would like to know exactly what term was found. For example, if > a > > > result comes back from the query above, how do I know whether John > Smith > > > was > > > found, or both John Smith and his company, or just John Smith > > Ma

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> > Now, I would like to know exactly what term was found. For example, if a > > result comes back from the query above, how do I know whether John Smith > > was > > found, or both John Smith and his company, or just John Smith > Manufacturing > > was found? > > > In general, this is actually very

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch wrote: > > query: "San Francisco" "California" +("John Smith" "John Smith > > Manufacturing") > > > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > > "John Smith Manufacturing") is required. > > > > Thanks Jake, that works nic

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> query: "San Francisco" "California" +("John Smith" "John Smith > Manufacturing") > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > "John Smith Manufacturing") is required. > Thanks Jake, that works nicely. Now, I would like to know exactly what term was found. For e

Re: Redundant fields Token class?

2009-11-13 Thread Robert Muir
Another example is if you used a stemmer, it might change the termLength: (walking -> walk), but the offsets of the original unstemmed word (walking) stay the same. On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler wrote: > This is not coupled because: > > termLength() is the number of chars in the

RE: Redundant fields Token class?

2009-11-13 Thread Uwe Schindler
This is not coupled because: termLength() is the number of chars in the term buffer, where the offsets give the offsets in the orginal char stream. If you use a CharFilter to e.g. remove chars, the termLength will get shorter, but the offset are still the original ones. Also both things are indexe

Redundant fields Token class?

2009-11-13 Thread Babak Farhang
I'm writing a TokenFilter and am confused about why class Token has both an *endOffset* and a *termLength* field. It would appear that the following invariant should always hold for a Token instance: termLength() == endOffset() - startOffset() If so, then 1) Why 2 fields, instead of 1? 2) W

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Did I do that wrong? I always mess up the AND/OR human-readable form of this - it's clearer when you use +/- unary operators instead: query: "San Francisco" "California" +("John Smith" "John Smith Manufacturing") Here the San Fran and CA clauses are optional, and the ("John Smith" OR "John Smith

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> You want a query like > > ("San Francisco" OR "California") AND ("John Smith" OR "John Smith > Manufacturing") > Won't his require San Francisco or California to be present? I do not require them to be, I only require "John Smith" OR "John Smith Manufacturing", but I want to get a bigger scor

Re: Term Boost Threshold

2009-11-13 Thread Jake Mannix
Hi Max, You want a query like ("San Francisco" OR "California") AND ("John Smith" OR "John Smith Manufacturing") essentially? You can give Lucene exactly this query and it will require that either "John Smith" or "John Smith Manufacturing" be present, but will score results which have these

Term Boost Threshold

2009-11-13 Thread Max Lynch
Hi, I am trying to move from a system where I counted the frequency of terms by hand in a highlighter to determine if a result was useful to me. In an earlier post on this list someone suggested I could boost the terms that are useful to me and only accept hits above a certain threshold. However,

Re: listing all fields used in any documents

2009-11-13 Thread Erick Erickson
Ooooh, that'll teach me . On Fri, Nov 13, 2009 at 1:30 PM, Uwe Schindler wrote: > List IndexReader.getFieldNames() ? > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: vsevel [mailto:v.se.

RE: listing all fields used in any documents

2009-11-13 Thread Uwe Schindler
List IndexReader.getFieldNames() ? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: vsevel [mailto:v.se...@lombardodier.com] > Sent: Friday, November 13, 2009 5:44 PM > To: java-user@lucene.apache.org > Su

Re: listing all fields used in any documents

2009-11-13 Thread Erick Erickson
Does TermEnum work in your situation? Best Erick On Fri, Nov 13, 2009 at 11:44 AM, vsevel wrote: > > Hi, > > I am indexing log4j/logback/JUL logging events. my documents includes > regular fields (eg: logger, message, date, ...) and custom fields that > applications choose to use (eg: MDC). > I

Re: Custom scoring algorithm

2009-11-13 Thread Alberto Gimeno
Hi again. I've made a proof of concept using the boost factor. I have done the following: add a field for each feature and put the field boost factor as the feature value. private static void addDocument(String id, Map features, IndexWriter writer) throws IOException { Doc

listing all fields used in any documents

2009-11-13 Thread vsevel
Hi, I am indexing log4j/logback/JUL logging events. my documents includes regular fields (eg: logger, message, date, ...) and custom fields that applications choose to use (eg: MDC). I would like to do full text searches on those fields just as I do on regular fields, I just need to know about th

Re: OutofMemory in large index

2009-11-13 Thread vsevel
Hi, I am jumping into the thread because I have got a similar issue. My index is 30Gb large and contains 21M docs. I was able to stay with 1Gb of RAM on the server for a while. Recently I started to simulate parallel searches. Just 2 parallel searches would get the server to crash with out of memo

Re: Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Anjana Sarkar
Hi Simon, Thank you very much for your reply. Maybe an example will help clarify my use case- Say I have the following two indexed columns with this data *data**boostfield* african ant10 alligator50 anthem20 antelope 30 another5

Custom scoring algorithm

2009-11-13 Thread Alberto Gimeno
Hi. I am developing an application and I would like to add searching capabilities. I have a database with items. Each item has a number of "features" with a numeric value. Example: feature_x=100, feature_y=200. Items can have common or different "features". And they can have a variable number of "

Re: Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Simon Willnauer
Anjana, maybe I don't understand you question correctly but what you want to do is a spell suggestion kind of thing on terms in the index, right? You try to use prefix query to display those terms as an auto-completion?! So I assume that what you do is run a query and then get the possible terms f

Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Anjana Sarkar
We are using lucene for one our projects here and has been working very well for last 2 years. The new requirement is to use it for autocomplete. Here , queries like a* or ab* pose a problem. I have set BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ) to get around the TooManyClausesException. T

Re: IndexWriter.close() no longer seems to close everything

2009-11-13 Thread Albert Juhe
I don't know how, because the problem is with JBOSS in a productions environment, in a localhost this don't happens. The JBOSS server is in a production environment, contains a lot of projects, I don't know if Lucene enters in fight with othe libraries. I don't have control of this computer, I

Re: IndexWriter.close() no longer seems to close everything

2009-11-13 Thread Michael McCandless
Alas, I can't repro this problem ("leaking file descriptors with NRT"), either. I've got a decent stress test setup -- start with a 5M Wikipedia index, update (delete & add) @ 1000 docs/sec (using 2 threads), reopen 10X per second, searching at redline (using 9 threads), and the open file descript

Re: OutofMemory in large index

2009-11-13 Thread Simon Willnauer
On Fri, Nov 13, 2009 at 12:01 PM, Wenbo Zhao wrote: > Thank you all...  I think I need to read more docs > > A little question : how to add more memory over 1G ? > When I specify more than -Xmx1450M, jvm gives error: >>java -Xmx1450m asdf > Exception in thread "main" java.lang.NoClassDefFoundError

Re: OutofMemory in large index

2009-11-13 Thread Wenbo Zhao
Thank you all... I think I need to read more docs A little question : how to add more memory over 1G ? When I specify more than -Xmx1450M, jvm gives error: >java -Xmx1450m asdf Exception in thread "main" java.lang.NoClassDefFoundError: asdf >java -Xmx1451m asdf Error occurred during initializati

Re: docBase Parameter in Collector.setNextReader

2009-11-13 Thread Michael McCandless
Phew :) Thanks for bringing closure! Mike On Fri, Nov 13, 2009 at 5:22 AM, Benjamin Heilbrunn wrote: > Hello, > > sorry for causing inconvenience. > It was my mistake and i wasn't able to reproduce it completely this morning. > > My testcase was a little to complex and there were two or three b

Re: OutofMemory in large index

2009-11-13 Thread Michael McCandless
Interrupting optimize shouldn't cause any problems. It should have no effect on the index, except possibly the partially created files might be orphan'd (left on disk but not referenced by the index), in which case they'll be cleaned up the next time you open a writer on the index. Still, running

Re: OutofMemory in large index

2009-11-13 Thread Simon Willnauer
On Fri, Nov 13, 2009 at 11:17 AM, Ian Lea wrote: >> I got OutOfMemoryError at >> org.apache.lucene.search.Searcher.search(Searcher.java:183) >> My index is 43G bytes.  Is that too big for Lucene ? >> Luke can see the index has over 1800M docs, but the search is also out >> of memory. >> I use -Xmx

Re: IndexWriter.close() no longer seems to close everything

2009-11-13 Thread Michael McCandless
Any luck narrowing this to a standalone test case that shows the problem? That new exception appears to be inside the Java code created by the app server compiling your JSP -- it's not very helpful since it doesn't "enter" Lucene. Can you try to narrow this to a standalone test case, too? Thanks

Re: docBase Parameter in Collector.setNextReader

2009-11-13 Thread Benjamin Heilbrunn
Hello, sorry for causing inconvenience. It was my mistake and i wasn't able to reproduce it completely this morning. My testcase was a little to complex and there were two or three bugs / false assumptions which made it look to me like i explained above. Benjamin --

Re: OutofMemory in large index

2009-11-13 Thread Ian Lea
> I got OutOfMemoryError at > org.apache.lucene.search.Searcher.search(Searcher.java:183) > My index is 43G bytes.  Is that too big for Lucene ? > Luke can see the index has over 1800M docs, but the search is also out > of memory. > I use -Xmx1024M to specify 1G java heap space. 43Gb is not too bi

Re: IndexWriter.close() no longer seems to close everything

2009-11-13 Thread Albert Juhe
Hi, About this problem I did a test yesterday, I did a downgrade, I changed versión 2.9.1 to 2.4.1, and the problem has been solved, all the files are closed corretly and JBOSS isn't unstable. Another problem that we have observed is: Sometimes, random success, when you try to make a serach the