Re: Lucene Error : java.io.FileNotFoundException

2008-07-03 Thread Michael McCandless


It looks like under JBoss you are accidentally using Lucene 1.4, not  
2.3.2.


Mike

yugana wrote:



Hi,

I am indexing content and searching using lucene. It is working fine  
when I
use the simple servlet and jsp mechanism. I am able to search on the  
indexed
content. I tried to implement the same using JBoss Portal. When I  
try to run
the search, I get the below error: Please help me to resolve the  
error. I am

using Lucene 2.3.2

09:43:42,671 ERROR [STDERR] java.io.FileNotFoundException:
D:\indexDir\segments (The system cannot find the file specified)
09:43:42,671 ERROR [STDERR] at  
java.io.RandomAccessFile.open(Native

Method)
09:43:42,671 ERROR [STDERR] at
java.io.RandomAccessFile.(RandomAccessFile.java:212)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.store.FSInputStream 
$Descriptor.(FSDirectory.java:376)

09:43:42,671 ERROR [STDERR] at
org.apache.lucene.store.FSInputStream.(FSDirectory.java:405)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:40)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:116)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.store.Lock$With.run(Lock.java:109)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
09:43:42,671 ERROR [STDERR] at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38)
09:43:42,671 ERROR [STDERR] at
com.xerox.mywebboard.search.SearchManager.search(SearchManager.java: 
53)

09:43:42,671 ERROR [STDERR] at
com 
.xerox 
.mywebboard 
.xeroxArticleSearchPortlet.search(xeroxArticleSearchPortlet.java:45)

09:43:42,671 ERROR [STDERR] at
com 
.xerox 
.mywebboard 
.xeroxArticleSearchPortlet 
.processAction(xeroxArticleSearchPortlet.java:27)

09:43:42,671 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.impl 
.jsr168.PortletContainerImpl.invokeAction(PortletContainerImpl.java

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.impl.jsr168.PortletContainerImpl.dispatch(PortletContainerImpl.java: 
401

09:43:42,687 ERROR [STDERR] at
org.jboss.portal.portlet.container.PortletContainerInvoker 
$1.invoke(PortletContainerInvoker.java

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal.common.invocation.Invocation.invokeNext(Invocation.java:131)

09:43:42,687 ERROR [STDERR] at
org.jboss.portal.core.aspects.portlet.TransactionInterceptor.org 
$jboss$portal$core$aspects$portl

09:43:42,687 ERROR [STDERR] at
org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
$invokeNotSupported_N454727078796479

09:43:42,687 ERROR [STDERR] at
org.jboss.aspects.tx.TxPolicy.invokeInNoTx(TxPolicy.java:66)
09:43:42,687 ERROR [STDERR] at
org.jboss.aspects.tx.TxInterceptor 
$NotSupported.invoke(TxInterceptor.java:112)

09:43:42,687 ERROR [STDERR] at
org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
$invokeNotSupported_N454727078796479

09:43:42,687 ERROR [STDERR] at
org.jboss.aspects.tx.TxPolicy.invokeInNoTx(TxPolicy.java:66)
09:43:42,687 ERROR [STDERR] at
org.jboss.aspects.tx.TxInterceptor 
$NotSupported.invoke(TxInterceptor.java:102)

09:43:42,687 ERROR [STDERR] at
org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
$invokeNotSupported_N454727078796479

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.core 
.aspects 
.portlet.TransactionInterceptor.invokeNotSupported(TransactionInter

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.core 
.aspects 
.portlet.TransactionInterceptor.invoke(TransactionInterceptor.java:

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.invocation.PortletInterceptor.invoke(PortletInterceptor.java:38)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal.common.invocation.Invocation.invokeNext(Invocation.java:115)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.core 
.aspects.portlet.HeaderInterceptor.invoke(HeaderInterceptor.java:50)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.invocation.PortletInterceptor.invoke(PortletInterceptor.java:38)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal.common.invocation.Invocation.invokeNext(Invocation.java:115)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.aspects 
.portlet.ProducerCacheInterceptor.invoke(ProducerCacheIntercepto

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.portlet 
.invocation.PortletInterceptor.invoke(PortletInterceptor.java:38)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal.common.invocation.Invocation.invokeNext(Invocation.java:115)

09:43:42,687 ERROR [STDERR] at
org 
.jboss 
.portal 
.core.aspects.portlet.AjaxInterceptor.invoke(AjaxInterceptor.java:51)

09:43:42,687 ERROR [STDERR] at

Store/Index Email Address in Lucene

2008-07-03 Thread miztaken

Hi there,
I want to index email address in such a way that i can do WildCard, Phrase
and Simple search on those items.

for each document i will have email addresses string just like in the case
of CC and TO in mails.
for eg "[EMAIL PROTECTED]; [EMAIL PROTECTED]; john hopkings; [EMAIL PROTECTED]"

Now what is the best way to store them so that i can do various type of
search on them.

Do i need the split the email address first and further split the single
email address as well and store them in multiple fields?

What is the best way to deal such case?

Your help is highly anticipated

Thank You
miztaken
-- 
View this message in context: 
http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18257247.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search question (newbie)

2008-07-03 Thread Chris Bamford

Hi,

Can someone point me in the right direction please?
How can I trap this situation correctly?  I receive user queries like 
this (quotes included):


   /from:"fred flintston*"/

Which produces a query string of

   /+from:fred  body:flintston/   (where /body/ is the default field)

What I want is:

/+from:fred +from:flintston*/

In other words, I want quoted expressions to be treated as single units..
Thanks for any pointers,

- Chris



Enhancing phrase searching in Lucene

2008-07-03 Thread Asbjørn A . Fellinghaug
Hi.

I've just finished my master thesis regarding how to enhance overall
phrase searching in search engines nowadays. The focus in the thesis is
to experiment with a new approach, whereas I've focused on pair of
words (bigrams). The thesis can be freely downloaded here [1].

What I've specifically experimented with is bigrams based on stopwords
and their characteristics. In this experiment there is created an
Analyzer which create bigram Tokens compounded of pair of words. First
we have a predefined list of stopwords, and then we analyze each token
in the Analyze. Given that a stopword token is identified, then we
create two new bigram tokens:
1) previouse token + stopword token
2) stopword token + next token

The identified stopword token is discarded, as it pose a huge posting
list in the inverted index. 

The overall main goal is to drastically reduce the posting lists
lengths, and thereby save I/O and processing made by Apache Lucene.
Based on the experiments performed, this new phrase searching approach
in Lucene introduce some performance gains.

The code which was created in the experiment will be made available
shortly. I just need to make some Javadoc, and prettify some. There is
nothing revolutionary in the code, as I've noticed by this maillist that
others have also been into this subject.

Hope someone finds some of the aspects discussed in my master thesis
useful. I've also, into some extend, tried to describe Apache Lucene and
how it works.

[1] http://asbjorn.fellinghaug.com/filer/master/Master_thesis.pdf

-- 
Asbjørn A. Fellinghaug
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Term Frequency for more complex terms

2008-07-03 Thread Matthew Hall
I have a quick question, could someone point me towards where in the API 
I'll have to investigate in order to figure out the term frequencies of 
more complex terms?


For example I want to know the tf of "kit ligand" treated as a phrase.  
I see that luke has access to this information in its explain method, 
but the api call is currently eluding me.


Thanks,

Matt

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memory Usage

2008-07-03 Thread Keith Watson

Hello All,

I have something that's not exactly causing me a major problem, but I  
would appreciate help in understanding the behaviour here:


I have an internet message board, and I soon hope to revamp the code  
to be using Lucene for searching the threads and posts, as it's far  
better than the database's fulltext capability. However, one of the  
sort of things I want to be able to do is for a user to be able to  
request a list of posts, written by user x, ordered by the newest  
first (and it's this sorting of the items by date that is the issue  
here).


To do this, I have a timestamp in the index, along with each post,  
user etc.


I find that if I use the Java SimpleDateFormat class to encode the  
timestamp like this: yyMMdd (let's not worry about the year 2100  
problem for now!), then I can measure the index cache (which is fully  
loaded, since I need to sort the results) as taking somewhere in the  
region of 30M of memory.


Now, I noticed that obviously if I index like the above, I won't get  
the correct sort order for several posts having been posted on the  
same day, so I changed it to index yyMMddHHmmss to index down to the  
second, rather than just the day. I didn't pay much attention to  
memory usage until I started getting out of heap space errors... When  
I looked into the usage I found:


(there are around 6,000,000 posts on the message board database)

Date encoded as yyMMdd: appears to be using around 30M
Date encoded as yyMMddHHmmss:  appears to be using more than 400M!

I guess I would have understood if I was seeing the usage double for  
sure, or even a little more; no idea how you guys encode the indexes,  
if at all, but it's gone up over tenfold, which I can't explain.


For now, I have just moved it back to do it on a per day basis, as  
it's not a huge deal, but can anyone help with this? Is there  
something I might be doing wrong? That's all I changed between the two  
runs, and it certainly seems to be repeatable. I tried upgrading from  
the previous version of Lucene to the latest one, but no difference.


Many thanks,

Keith.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search question (newbie)

2008-07-03 Thread John Griffin
Chris,

I've had similar requirements in the past. First strip the quotes then
create a BooleanQuery consisting of two separate queries.

1. TermQuery for the first term - Fred
2. PrefixQuery for the second term - Flintstone

When you add each individual query to the BooleanQuery make sure the
BooleanClause.Occur parameter is set to MUST (look at the BooleanQuery API
docs). 

Use the toString() method on the BooleanQuery after it's created to make
sure you did it correctly.

John G.

-Original Message-
From: Chris Bamford [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 03, 2008 7:39 AM
To: java-user@lucene.apache.org
Subject: Search question (newbie)

Hi,

Can someone point me in the right direction please?
How can I trap this situation correctly?  I receive user queries like 
this (quotes included):

/from:"fred flintston*"/

Which produces a query string of

/+from:fred  body:flintston/   (where /body/ is the default field)

What I want is:

/+from:fred +from:flintston*/

In other words, I want quoted expressions to be treated as single units..
Thanks for any pointers,

- Chris



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Term Frequency for more complex terms

2008-07-03 Thread John Griffin
Matthew,

I not totally sure what you are asking but if it's 'where do I call the
explain method from?' it looks like you want to call it from the
IndexSearcher class. Look at the API docs for Searcher (the IndexSearcher's
superclass).

John G.

P.S.
If that's not it, look for explain in the API docs by clicking on Index at
the top of the docs. They're all there.


-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 03, 2008 10:20 AM
To: lucene
Subject: Term Frequency for more complex terms

I have a quick question, could someone point me towards where in the API 
I'll have to investigate in order to figure out the term frequencies of 
more complex terms?

For example I want to know the tf of "kit ligand" treated as a phrase.  
I see that luke has access to this information in its explain method, 
but the api call is currently eluding me.

Thanks,

Matt

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Store/Index Email Address in Lucene

2008-07-03 Thread John Griffin
Miz,

The StandardAnalyzer recognizes email addresses as is. That is, it pays
attention to the '@' symbol. Just store an email address in a field and
search them normally.

This assumes you are going to store the different emails in separate fields.
There is an alternative strategy if you need it. Create a string consisting
of all the emails separated by whitespace. Make sure the field is tokenized
and then you only have to search one field for any of the emails.

Your call.

John G.

-Original Message-
From: miztaken [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 03, 2008 5:31 AM
To: java-user@lucene.apache.org
Subject: Store/Index Email Address in Lucene


Hi there,
I want to index email address in such a way that i can do WildCard, Phrase
and Simple search on those items.

for each document i will have email addresses string just like in the case
of CC and TO in mails.
for eg "[EMAIL PROTECTED]; [EMAIL PROTECTED]; john hopkings; [EMAIL PROTECTED]"

Now what is the best way to store them so that i can do various type of
search on them.

Do i need the split the email address first and further split the single
email address as well and store them in multiple fields?

What is the best way to deal such case?

Your help is highly anticipated

Thank You
miztaken
-- 
View this message in context:
http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p1825724
7.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory Usage

2008-07-03 Thread Paul Smith



(there are around 6,000,000 posts on the message board database)

Date encoded as yyMMdd: appears to be using around 30M
Date encoded as yyMMddHHmmss:  appears to be using more than 400M!

I guess I would have understood if I was seeing the usage double for  
sure, or even a little more; no idea how you guys encode the  
indexes, if at all, but it's gone up over tenfold, which I can't  
explain.


Sort memory cost is based on the total # of unique terms for the given  
field (multiplied by the number of locale's involved if you have to do  
that too! but in temporal sorting you don't).


This is easier than you think, just use 2 fields (date, time) and sort  
by both.  This means the Date field's unique term count grows only 1  
term per day.  The Time field can be set to minutes (if you can get  
away with that) meaning that you only have fairly insignificant total  
term count for the time field.  We use this at Aconex,  and have  
indexes with millions of records (weekly 'work' searcher refreshed  
every 5 seconds, archive searcher is held in memory, with a  
Multisearcher done over the 2) and it works a treat.  We regularly  
need to return million+ results from a search (don't ask) using this  
sort of sorting and the overall search time is only a few seconds.


On a related note, work hard not to need to use Locale sensitive  
sorting if you can for any other fields, for large results the CPU  
penalty is horrific (even once you get past the synchronization  
bottleneck in the CollationKey stuff).


cheers,

Paul Smith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Error : java.io.FileNotFoundException

2008-07-03 Thread yugana

I have checked all the jars and tried replacing with the same versions. Still
I get the same error. Please let me know what else to check.

yug


Michael McCandless-2 wrote:
> 
> 
> It looks like under JBoss you are accidentally using Lucene 1.4, not  
> 2.3.2.
> 
> Mike
> 
> yugana wrote:
> 
>>
>> Hi,
>>
>> I am indexing content and searching using lucene. It is working fine  
>> when I
>> use the simple servlet and jsp mechanism. I am able to search on the  
>> indexed
>> content. I tried to implement the same using JBoss Portal. When I  
>> try to run
>> the search, I get the below error: Please help me to resolve the  
>> error. I am
>> using Lucene 2.3.2
>>
>> 09:43:42,671 ERROR [STDERR] java.io.FileNotFoundException:
>> D:\indexDir\segments (The system cannot find the file specified)
>> 09:43:42,671 ERROR [STDERR] at  
>> java.io.RandomAccessFile.open(Native
>> Method)
>> 09:43:42,671 ERROR [STDERR] at
>> java.io.RandomAccessFile.(RandomAccessFile.java:212)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.store.FSInputStream 
>> $Descriptor.(FSDirectory.java:376)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.store.FSInputStream.(FSDirectory.java:405)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:40)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:116)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.store.Lock$With.run(Lock.java:109)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
>> 09:43:42,671 ERROR [STDERR] at
>> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38)
>> 09:43:42,671 ERROR [STDERR] at
>> com.xerox.mywebboard.search.SearchManager.search(SearchManager.java: 
>> 53)
>> 09:43:42,671 ERROR [STDERR] at
>> com 
>> .xerox 
>> .mywebboard 
>> .xeroxArticleSearchPortlet.search(xeroxArticleSearchPortlet.java:45)
>> 09:43:42,671 ERROR [STDERR] at
>> com 
>> .xerox 
>> .mywebboard 
>> .xeroxArticleSearchPortlet 
>> .processAction(xeroxArticleSearchPortlet.java:27)
>> 09:43:42,671 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .portlet 
>> .impl 
>> .jsr168.PortletContainerImpl.invokeAction(PortletContainerImpl.java
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .portlet 
>> .impl.jsr168.PortletContainerImpl.dispatch(PortletContainerImpl.java: 
>> 401
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.portal.portlet.container.PortletContainerInvoker 
>> $1.invoke(PortletContainerInvoker.java
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal.common.invocation.Invocation.invokeNext(Invocation.java:131)
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.portal.core.aspects.portlet.TransactionInterceptor.org 
>> $jboss$portal$core$aspects$portl
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
>> $invokeNotSupported_N454727078796479
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.aspects.tx.TxPolicy.invokeInNoTx(TxPolicy.java:66)
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.aspects.tx.TxInterceptor 
>> $NotSupported.invoke(TxInterceptor.java:112)
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
>> $invokeNotSupported_N454727078796479
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.aspects.tx.TxPolicy.invokeInNoTx(TxPolicy.java:66)
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.aspects.tx.TxInterceptor 
>> $NotSupported.invoke(TxInterceptor.java:102)
>> 09:43:42,687 ERROR [STDERR] at
>> org.jboss.portal.core.aspects.portlet.TransactionInterceptor 
>> $invokeNotSupported_N454727078796479
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .core 
>> .aspects 
>> .portlet.TransactionInterceptor.invokeNotSupported(TransactionInter
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .core 
>> .aspects 
>> .portlet.TransactionInterceptor.invoke(TransactionInterceptor.java:
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .portlet 
>> .invocation.PortletInterceptor.invoke(PortletInterceptor.java:38)
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal.common.invocation.Invocation.invokeNext(Invocation.java:115)
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .core 
>> .aspects.portlet.HeaderInterceptor.invoke(HeaderInterceptor.java:50)
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal 
>> .portlet 
>> .invocation.PortletInterceptor.invoke(PortletInterceptor.java:38)
>> 09:43:42,687 ERROR [STDERR] at
>> org 
>> .jboss 
>> .portal.common.invocation.Invocation.invokeNext(Invoca

Re: Store/Index Email Address in Lucene

2008-07-03 Thread Jamie

Hi miztaken

Check out:

http://openmailarchiva.svn.sourceforge.net/viewvc/openmailarchiva/Server/trunk/src/com/stimulus/archiva/search/EmailFilter.java?view=markup

I think its what you want.

I want to index email address in such a way that i can do WildCard, Phrase
and Simple search on those items.

for each document i will have email addresses string just like in the case
of CC and TO in mails.
for eg "[EMAIL PROTECTED]; [EMAIL PROTECTED]; john hopkings; [EMAIL PROTECTED]"

Now what is the best way to store them so that i can do various type of
search on them.

Do i need the split the email address first and further split the single
email address as well and store them in multiple fields?

What is the best way to deal such case?
  


Regards,

Jamie

--
Stimulus Software - MailArchiva
Email Archiving And Compliance



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



too many clauses exception

2008-07-03 Thread Gaurav Sharma


Hi,

I am stuck with one more exception.
When i am using a wild card such as a* i am getting too many clauses
exception. It saying maximum clause count is set to 1024. Is there any way
to increase this count.
Can u please help me out in overcoming this.

Thanks in advance.
-Gaurav



-
-Gaurav
-- 
View this message in context: 
http://www.nabble.com/indexing-unsupported-mime-types-using-Lucene-tp17983491p18273569.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



too many clauses exception

2008-07-03 Thread Gaurav Sharma

Hi,

I am stuck with an exception in lucene (too many clauses).
When i am using a wild card such as a* i am getting too many clauses
exception. It saying maximum clause count is set to 1024. Is there any way
to increase this count.
Can u please help me out in overcoming this.

Thanks in advance.
-Gaurav

-
-Gaurav
-- 
View this message in context: 
http://www.nabble.com/too-many-clauses-exception-tp18273582p18273582.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: too many clauses exception

2008-07-03 Thread Chris Lu
This is easy, use:
BooleanQuery.setMaxClauseCount(4096);

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Thu, Jul 3, 2008 at 11:23 PM, Gaurav Sharma <[EMAIL PROTECTED]>
wrote:

>
>
> Hi,
>
> I am stuck with one more exception.
> When i am using a wild card such as a* i am getting too many clauses
> exception. It saying maximum clause count is set to 1024. Is there any way
> to increase this count.
> Can u please help me out in overcoming this.
>
> Thanks in advance.
> -Gaurav
>
>
>
> -
> -Gaurav
> --
> View this message in context:
> http://www.nabble.com/indexing-unsupported-mime-types-using-Lucene-tp17983491p18273569.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Multifield Search with OR and AND on different doc Fields

2008-07-03 Thread RanjithStar

My requirement is to search on SEVEN Fields say F1,F2,F3,F4,F5,F6,F7 having
F1,F2,F3,F4 on one doc index
and F5,F6,F7 on a different doc index


I need to perform a search with ((F1=9 AND F2=4) AND (F3=keyword OR
F4=keyword)) OR (F5=9 AND F6=4 AND F7=keyword)

For normal search I was doing like this:
String[] sFields = { ID1, ID2, TITLE, CONTENT };
String[] sQuery = { id1, id2, sKeyword, sKeyword };
Occur[] flag = { BooleanClause.Occur.MUST, BooleanClause.Occur.MUST,
BooleanClause.Occur.MUST, BooleanClause.Occur.MUST }; 

Query oQuery = oMultiParser.parse(sQuery, sFields, flag, oAnalyzer) ;
Hits hits = indexSearcher.search(oQuery);


How can I modify the above query in such a way that it has to search on
different doc Indexes?
-- 
View this message in context: 
http://www.nabble.com/Multifield-Search-with-OR-and-AND-on-different-doc-Fields-tp18273644p18273644.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Store/Index Email Address in Lucene

2008-07-03 Thread miztaken

Hi there,
Thanks for the comment.
So basically it will be lame to add new field for each email address, wont
it?

How about getting unique tokens from string of email addresses using
EmailFilter.java class and storing it in as a single field ?




Jamie-52 wrote:
> 
> Hi miztaken
> 
> Check out:
> 
> http://openmailarchiva.svn.sourceforge.net/viewvc/openmailarchiva/Server/trunk/src/com/stimulus/archiva/search/EmailFilter.java?view=markup
> 
> I think its what you want.
>> I want to index email address in such a way that i can do WildCard,
>> Phrase
>> and Simple search on those items.
>>
>> for each document i will have email addresses string just like in the
>> case
>> of CC and TO in mails.
>> for eg "[EMAIL PROTECTED]; [EMAIL PROTECTED]; john hopkings; [EMAIL 
>> PROTECTED]"
>>
>> Now what is the best way to store them so that i can do various type of
>> search on them.
>>
>> Do i need the split the email address first and further split the single
>> email address as well and store them in multiple fields?
>>
>> What is the best way to deal such case?
>>   
> 
> Regards,
> 
> Jamie
> 
> -- 
> Stimulus Software - MailArchiva
> Email Archiving And Compliance
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18273786.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory Usage

2008-07-03 Thread Keith Watson


Thanks very much for this; I'll give it a shot.

Keith.


On 4 Jul 2008, at 00:02, Paul Smith wrote:




(there are around 6,000,000 posts on the message board database)

Date encoded as yyMMdd: appears to be using around 30M
Date encoded as yyMMddHHmmss:  appears to be using more than 400M!

I guess I would have understood if I was seeing the usage double  
for sure, or even a little more; no idea how you guys encode the  
indexes, if at all, but it's gone up over tenfold, which I can't  
explain.


Sort memory cost is based on the total # of unique terms for the  
given field (multiplied by the number of locale's involved if you  
have to do that too! but in temporal sorting you don't).


This is easier than you think, just use 2 fields (date, time) and  
sort by both.  This means the Date field's unique term count grows  
only 1 term per day.  The Time field can be set to minutes (if you  
can get away with that) meaning that you only have fairly  
insignificant total term count for the time field.  We use this at  
Aconex,  and have indexes with millions of records (weekly 'work'  
searcher refreshed every 5 seconds, archive searcher is held in  
memory, with a Multisearcher done over the 2) and it works a treat.   
We regularly need to return million+ results from a search (don't  
ask) using this sort of sorting and the overall search time is only  
a few seconds.


On a related note, work hard not to need to use Locale sensitive  
sorting if you can for any other fields, for large results the CPU  
penalty is horrific (even once you get past the synchronization  
bottleneck in the CollationKey stuff).


cheers,

Paul Smith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]