Queries not derived from the text index

2006-02-06 Thread Daniel Noll
Hi. I've got an unusual (if not crazy) question about implementing custom queries. Basically we have a UI where a user can enter a query and then select a bunch of filters to be applied to the query. These filters are currently implemented using a fairly simple wrapper around Lucene's own

Memory Increasing when the optimize is called

2006-02-06 Thread Ravi
Hi I have index file around 2GB and when I optimize the index file tomcat is taking more memory than normal and even after completion of optimization also it is still taking more memory than the normal memory.. is this way it will happen or do I need to change any thing to reduce the memor

Fw: Sorting by Score

2006-02-06 Thread Daniel . Clark
~ Daniel Clark, Senior Consultant Sybase Federal Professional Services 6550 Rock Spring Drive, Suite 800 Bethesda, MD 20817 Office - (301) 896-1103 Office Fax - (301) 896-1604 Mobile - (703) 403-0340 ~ - Forward

Re: use Lucene to index sentences

2006-02-06 Thread Marc Hadfield
Hi AJ - Performance would depend on the kind of queries you are going to perform against sentences. If you are going to be querying for phrases (multi-token), want to make use of stemming, or any kind of term expansion (wildcare, synonyms, etc), I imagine lucene would be much superior, but I

Re: Inappropriate content detection

2006-02-06 Thread Daniel Noll
Jason Polites wrote: There is also an open source java anti spam api which does a baysian scan of email content (plus other stuff). You could retro-fit to work with raw text. There is also Classifier4J, which is more geared toward pure classification (comes with a Bayesian classifier but oth

Re: use Lucene to index sentences

2006-02-06 Thread AJ Chen
Hi Marc, Thanks for your suggestions. Marking sentences in documents and using span query is a good approach. How do you compare its performance to a database approach? For example, sentences can be stored in mysql, one sentence per row, and they can be searched by mysql's full text search feature

How to get mapping of query terms to number of their occurrences in a doc?

2006-02-06 Thread Dmitry Goldenberg
Given a query, I want to be able to, for each query term, get the number of occurrences of the term. I have tried what I'm including below and it does not seem to provide reliable results. Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of

Re: Inappropriate content detection

2006-02-06 Thread Jason Polites
There is also an open source java anti spam api which does a baysian scan of email content (plus other stuff). You could retro-fit to work with raw text. www.jasen.org (get the latest HEAD from CVS as the current release is a bit old... new version imminent) - Original Message - From:

Re: use Lucene to index sentences

2006-02-06 Thread Marc Hadfield
Hi AJ - Depending on your need, you could create a lucene document for each sentence (in which case searching and returning sentences is trivial), or create a lucene document for each of your documents, with embedded sentence start/stop markers (as a special symbol). or, instead of a special

use Lucene to index sentences

2006-02-06 Thread AJ Chen
I'll appreciate any advice on whether Lucene is appropriate for index/search sentences. I have millions of documents broken down into millions of sentences. Each sentence does not exist as a document. All these sentences are in a small number of big files. How can I use Lucene to index/search the

Re: Manually create a term freq vector

2006-02-06 Thread Grant Ingersoll
You may find this useful: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/[EMAIL PROTECTED] Johan Oskarsson wrote: Hi. I'm trying to speed up my indexing process and since I already know how many times I want a specific word to occur in the term frequence vector I'd like

Manually create a term freq vector

2006-02-06 Thread Johan Oskarsson
Hi. I'm trying to speed up my indexing process and since I already know how many times I want a specific word to occur in the term frequence vector I'd like to be able to create the vector myself. This would speed things up because I wouldn't have to take the extra step of creating a string with

Re: understand the queryNorm and the fieldNorm.

2006-02-06 Thread jason
hi, thx. I think i forget the ^0.5 cheers Jason On 2/6/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Hi Jason, > I get the same thing for the queryNorm when I calculate it by hand: > 1/((1.7613963**2 + 1.326625**2)**.5) = 0.45349488111693986 > > -Yonik > > On 2/6/06, jason <[EMAIL PROTECTED]

Re: Hit Highlighting: To store or not to store?

2006-02-06 Thread Erik Hatcher
Hugh, Both approaches are certainly in use in various projects. I typically opt for option #1, but that is because it is feasible giving the data I work with, and how it is managed. However, the decision is really based on the size of the text to be highlighted and whether it makes sense

Re: time of search for an index with the file .FDT much large

2006-02-06 Thread Yonik Seeley
20 seconds does seem like a long time to retrieve the stored fields of the 3000 documents. However, you should also step back and determine if you really need to do that, or if there is another way to narrow the number of documents that need to be read from disk. -Yonik On 2/6/06, Antonio Bruno

RE: Field search problem(only single word query works)

2006-02-06 Thread Xin Herbert Wu
The Luke search worked on the index files. But my query client may be not built correctly. Upon further test, I supplied an UnStored field in library B with a guaranteed value - white space(previously it sometimes has new StringBuffer().toString() empty value). This makes my query client works for

time of search for an index with the file .FDT much large

2006-02-06 Thread Antonio Bruno
I have an index with 2,5 million documents. A document is formed in this way: - 15 fields index - 1 field stored but not indexed, whose value is one string of 500 byte. A search in average gives back the 3000 document. As 3000 id of documents is given back a lot fastly, the 3000 documents inst

Hit Highlighting: To store or not to store?

2006-02-06 Thread Hugh Ross
We have a project with approximately 20,000 documents which require searching with hit highlighting on the content. The content is of variable size. My question is which option to take to support hit highlighting: 1. To store the content as a field in the lucene document and to highlight hit

time of search for an index with the file .FDT much large

2006-02-06 Thread Antonio Bruno
Hi, I have an index with 2,5 million documents. A document is formed in this way: - 15 fields index - 1 field stored but not indexed, whose value is one string of 500 byte. A search in average gives back the 3000 document. As 3000 id of documents is given back a lot fastly, the 3000 document

Re: understand the queryNorm and the fieldNorm.

2006-02-06 Thread Yonik Seeley
Hi Jason, I get the same thing for the queryNorm when I calculate it by hand: 1/((1.7613963**2 + 1.326625**2)**.5) = 0.45349488111693986 -Yonik On 2/6/06, jason <[EMAIL PROTECTED]> wrote: > Hi, > > I have a problem of understanding the queryNorm and fieldNorm. > > The following is an example. I

Re: index merging

2006-02-06 Thread Yonik Seeley
On 2/6/06, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote: > Sorry to contradict you Yonik, but I'm pretty sure the commit lock is > *not* locked during a merge, only while the "segments" file is being > updated. Oops, you're right. Good thing too... if the commit lock was held during merges, one co

RE: Inappropriate content detection

2006-02-06 Thread Gwyn Carwardine
The good bit about Bayesian is that it continuously learns. The downside is that you have to teach it. Not quite as simple as a list of rude words. There's an open source Bayesian mail filter called spambayes (http://spambayes.sourceforge.net) which may lead you to interesting places. -Gwyn -

RE: Inappropriate content detection

2006-02-06 Thread Jeff Thorne
The site will have million+ posts. I am not familiar with Bayesian algorithms. Is there an off the shelf API that can provide this type of capability. As for performance would Bayesian be the way to go over Lucene? Thanks for the help, Jeff -Original Message- From: gekkokid [mailto:[EMAIL

To understand the queryNorm and fieldNorm

2006-02-06 Thread jason
Hi, I have a problem of understanding the queryNorm and fieldNorm. The following is an example. I try to follow what said in the Javadoc "Computes the normalization value for a query given the sum of the squared weights of each of the query terms". But the result is different. ID:0 C:/PDF2Text/S

Re: two problems of using the lucene.

2006-02-06 Thread Erik Hatcher
On Feb 6, 2006, at 1:37 AM, jason wrote: The source code of the Queryparser.java is hard to read. Look at QueryParser.jj instead. QueryParser.java is generated using JavaCC and is thus not "source" code at all. Erik ---

understand the queryNorm and the fieldNorm.

2006-02-06 Thread jason
Hi, I have a problem of understanding the queryNorm and fieldNorm. The following is an example. I try to follow what said in the Javadoc "Computes the normalization value for a query given the sum of the squared weights of each of the query terms". But the result is different. ID:0 C:/PDF2Text/S

Request for feedback: CBIR for Lucene

2006-02-06 Thread Mathias Lux
Hi all! I've put up some classes for storing content based MPEG-7 image descriptors in a lucene index and querying the stored descriptors to get "similar" images. In other words: I've put up a simple library for content based image retrieval powered by lucene. The performance tests are quite prom