Hi.
I've got an unusual (if not crazy) question about implementing custom
queries.
Basically we have a UI where a user can enter a query and then select a
bunch of filters to be applied to the query. These filters are
currently implemented using a fairly simple wrapper around Lucene's own
Hi
I have index file around 2GB and when I optimize the index file tomcat is
taking more memory than normal and even after completion of optimization
also it is still taking more memory than the normal memory.. is this way it
will happen or do I need to change any thing to reduce the memor
~
Daniel Clark, Senior Consultant
Sybase Federal Professional Services
6550 Rock Spring Drive, Suite 800
Bethesda, MD 20817
Office - (301) 896-1103
Office Fax - (301) 896-1604
Mobile - (703) 403-0340
~
- Forward
Hi AJ -
Performance would depend on the kind of queries you are going to perform
against sentences. If you are going to be querying for phrases
(multi-token), want to make use of stemming, or any kind of term
expansion (wildcare, synonyms, etc), I imagine lucene would be much
superior, but I
Jason Polites wrote:
There is also an open source java anti spam api which does a baysian
scan of
email content (plus other stuff).
You could retro-fit to work with raw text.
There is also Classifier4J, which is more geared toward pure
classification (comes with a Bayesian classifier but oth
Hi Marc,
Thanks for your suggestions. Marking sentences in documents and using span
query is a good approach. How do you compare its performance to a database
approach? For example, sentences can be stored in mysql, one sentence per
row, and they can be searched by mysql's full text search feature
Given a query, I want to be able to, for each query term, get the number of
occurrences of the term. I have tried what I'm including below and it does not
seem to provide reliable results. Seems to work fine with exact matching but
as soon as stemming kicks in, all bets are off as to value of
There is also an open source java anti spam api which does a baysian scan of
email content (plus other stuff).
You could retro-fit to work with raw text.
www.jasen.org
(get the latest HEAD from CVS as the current release is a bit old... new
version imminent)
- Original Message -
From:
Hi AJ -
Depending on your need, you could create a lucene document for each
sentence (in which case searching and returning sentences is trivial),
or create a lucene document for each of your documents, with embedded
sentence start/stop markers (as a special symbol). or, instead of a
special
I'll appreciate any advice on whether Lucene is appropriate for index/search
sentences. I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document. All these sentences
are in a small number of big files. How can I use Lucene to index/search the
You may find this useful:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/[EMAIL PROTECTED]
Johan Oskarsson wrote:
Hi.
I'm trying to speed up my indexing process and
since I already know how many times I want a specific
word to occur in the term frequence vector I'd like
Hi.
I'm trying to speed up my indexing process and
since I already know how many times I want a specific
word to occur in the term frequence vector I'd like to be
able to create the vector myself.
This would speed things up because I wouldn't have to
take the extra step of creating a string with
hi, thx.
I think i forget the ^0.5
cheers
Jason
On 2/6/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> Hi Jason,
> I get the same thing for the queryNorm when I calculate it by hand:
> 1/((1.7613963**2 + 1.326625**2)**.5) = 0.45349488111693986
>
> -Yonik
>
> On 2/6/06, jason <[EMAIL PROTECTED]
Hugh,
Both approaches are certainly in use in various projects. I
typically opt for option #1, but that is because it is feasible
giving the data I work with, and how it is managed.
However, the decision is really based on the size of the text to be
highlighted and whether it makes sense
20 seconds does seem like a long time to retrieve the stored fields of
the 3000 documents. However, you should also step back and determine
if you really need to do that, or if there is another way to narrow
the number of documents that need to be read from disk.
-Yonik
On 2/6/06, Antonio Bruno
The Luke search worked on the index files. But my query client may be not
built correctly. Upon further test, I supplied an UnStored field in library
B with a guaranteed value - white space(previously it sometimes has new
StringBuffer().toString() empty value). This makes my query client works
for
I have an index with 2,5 million documents.
A document is formed in this way:
- 15 fields index
- 1 field stored but not indexed, whose value is one string of 500 byte.
A search in average gives back the 3000 document. As 3000 id of documents is
given back a lot fastly, the 3000 documents inst
We have a project with approximately 20,000 documents which require
searching with hit highlighting on the content. The content is of variable
size. My question is which option to take to support hit highlighting:
1. To store the content as a field in the lucene document and to highlight
hit
Hi,
I have an index with 2,5 million documents.
A document is formed in this way:
- 15 fields index
- 1 field stored but not indexed, whose value is one string of 500 byte.
A search in average gives back the 3000 document. As 3000 id of documents is
given back a lot fastly, the 3000 document
Hi Jason,
I get the same thing for the queryNorm when I calculate it by hand:
1/((1.7613963**2 + 1.326625**2)**.5) = 0.45349488111693986
-Yonik
On 2/6/06, jason <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have a problem of understanding the queryNorm and fieldNorm.
>
> The following is an example. I
On 2/6/06, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote:
> Sorry to contradict you Yonik, but I'm pretty sure the commit lock is
> *not* locked during a merge, only while the "segments" file is being
> updated.
Oops, you're right. Good thing too... if the commit lock was held
during merges, one co
The good bit about Bayesian is that it continuously learns.
The downside is that you have to teach it.
Not quite as simple as a list of rude words.
There's an open source Bayesian mail filter called spambayes
(http://spambayes.sourceforge.net) which may lead you to interesting places.
-Gwyn
-
The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?
Thanks for the help,
Jeff
-Original Message-
From: gekkokid [mailto:[EMAIL
Hi,
I have a problem of understanding the queryNorm and fieldNorm.
The following is an example. I try to follow what said in the Javadoc
"Computes the normalization value for a query given the sum of the squared
weights of each of the query terms". But the result is different.
ID:0 C:/PDF2Text/S
On Feb 6, 2006, at 1:37 AM, jason wrote:
The source code of the Queryparser.java is hard to read.
Look at QueryParser.jj instead. QueryParser.java is generated using
JavaCC and is thus not "source" code at all.
Erik
---
Hi,
I have a problem of understanding the queryNorm and fieldNorm.
The following is an example. I try to follow what said in the Javadoc
"Computes the normalization value for a query given the sum of the squared
weights of each of the query terms". But the result is different.
ID:0 C:/PDF2Text/S
Hi all!
I've put up some classes for storing content based MPEG-7 image
descriptors in a lucene index and querying the stored descriptors to get
"similar" images. In other words: I've put up a simple library for
content based image retrieval powered by lucene.
The performance tests are quite prom
27 matches
Mail list logo