Re: Security filtering from external DB

2008-02-27 Thread Gabriel Landais

h t a écrit :

I guess you can implement createBitSet() more effciently by using
Filer,but not BooleanQuery

Hi,
thanks for advice, but did you mean Filter or Filer? And even if I 
should use a Filter, I don't really understand how to replace the 
Boolean query :(
The boolean query is already very efficient, so if it can be better, 
it's would be great!

Regards,
Gabriel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Inconsistent Search Speed

2008-02-27 Thread Grant Ingersoll
You could also look at the FieldSelector when getting the Document.   
Such that you only load the one field you need


-Grant

On Feb 26, 2008, at 10:13 PM, Mark Miller wrote:

The Lucene prime directive: dont iterate through all of Hits! Its  
horribly inefficient. You must use a hitcollector. Even still,  
getting your field values will be slow no matter what if you get for  
every hit. You don't want to do this for every hit in a search. But  
don't loop through Hits.


fangz wrote:

Thank you for the info.  It makes sense.
My search will return more than 1 documents and I have to loop  
through
all documents to build a list with unique field values. It seems  
that the
looping of the hits takes the longest time in the initial run but  
afterwards
it becomes much faster. If the hits are relatively small, I do not  
see the

same behavior.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene java OOM while sorting more than one field

2008-02-27 Thread GURUPRASAD MS
 Lucene Index contains 2.1 Million records (indexed from 2.1 million records
from sqlserver DB).
Lucene Index file Size 256MB
Lucene version: 2.3
Searching works fine when we sort the results on a single field. However, if
the search results is sorted on more than one field we get Out of Memory
exception.
We restrict the number of search results to 250

Out of Memory exception is quite consistent in 2.3. We recently moved from
2.0 to 2.3.
Version 2.0 also used to give the OOM but not this frequent.


*Code Snippet:*

final QueryParser parser;
StandardAnalyzer analyzer = new StandardAnalyzer();
String sSearchQuery = "MUSIC";
Sort oSort = null;
public static final String G2KEYFIELDS = "G2KEYFIELDS";
oSort = new Sort(new SortField[]{ new SortField (
GroupsConstants.LAST_MODIFIED,SortField.INT,true), new SortField(
GroupsConstants.GROUPNAME) });

parser = new QueryParser(G2KEYFIELDS, analyzer);
final Vector ids = new Vector();
FSDirectory dir = null;
IndexSearcher searcher = null;
try {
dir = FSDirectory.getDirectory(index);
searcher = new IndexSearcher(dir);
Query query = parser.parse(sSearchQuery);
Hits hits = searcher.search(query, oSort);
for (int i = 0; i != hits.length() && i != 250; ++i) {
final Document doc = hits.doc;
Integer oiGroupId=new Integer(doc.getField (GroupsConstants.IDENTITY
).stringValue());
if(!ids.contains(oiGroupId)){ ids.addElement(oiGroupId); }
}
searcher.close();
dir.close();


Thanks in Advance


Atomicity and AutoCommit

2008-02-27 Thread Simon Wistow
I currently have a set up that indexes into RAM and then periodically 
merges that into a disk based index. 

Searches are done from the disk based index and deletes are handled by 
keeping a list of deleted documents, filtering out search results and 
applying the deletes to the index at merge time.

All this was done to make sure that we didn't corrupt the index (which 
we'd seen happen a few times when the indexing machine failed for 
whatever reason). With this scheme if the machine fails then all that's 
lost is the RAM index and the list of deletes. We then just simply play 
back all actions since the last merge and we're back to where we 
started.

However it occurred to me that this might all be redundant now with 
Lucene 2.3 (it's possible it might have always been redundant come to 
think of it) - should I just open a Disk based Index with 
autocommit=false and then periodically commit the changes by close()ing 
and then re-open()ing the Disk index ? Is that atomic? i.e is there a 
situation using this whereby the index could become corrupted?

Thanks,

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Performance

2008-02-27 Thread Michael Prichard
I'm wondering if your date field's precision may be a little too  
much?  What I mean is that you are going all the way down to  
seconds.  Whenever you do a range query you are essentially spawning  
a BooleanQuery with a representation of that range.  Do you really  
need to be that precise?  I usually stick with MMDD for search  
date fields and it works pretty well.  So you know, I have a 13 GB  
index with 3 million records and my search time is very low.   
Definitely under 1 second.


Just a thought.

On Feb 27, 2008, at 6:14 AM, Jamie wrote:


Hi Michael & Others

Ok. I've gathered some more statistics from a different machine for  
your analysis.
(I had to switch machines because the original one was in  
production and my tests were interfering).


Here are the statistics from the new machine:

Total Documents: 1.2 million
Results Returned:  900k
Store Size 238G (size of original documents)
Index Size 1.2G (lucene index size)
Index / Store Ratio 0.5%

The search query is as follows:

archivedate:[d2007122901 TO d20080228235900]

As you can see, I am using a range query to search between specific  
dates.
Question: should this query be moved to a filter rather? I did not  
do this as I needed to have the option to sort on date.


There are no other specific filters applied and in this example  
sorting is turned off.


On this particular machine the search time varies between 2.64  
seconds and about 5 seconds.


The limitations of this machine are that it does uses a normal IDE  
drive to house the index, not a SATA drive


IOStat Statistics

Linux 2.6.20-15-server 27/02/2008

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
20.250.003.230.340.00   76.19

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda   7.1250.67   186.41   38936841  143240688

See attached for hardware info and the CPU call tree (taken from  
YourKit).


I would appreciate your recommendations.

Jamie

h t wrote:
Hi Michael,
I guess the hotspot of lucene is
org.apache.lucene.search.IndexSearcher.search()

Hi Jamie,
What's the original text size of a million emails?
I estimate the size of an email is around 100k, is this true?
When you doing search, what kind keywords did you input, words or  
short

sentence?
How many results return?
Did you use filter to shrink the results size?

2008/2/27, Michael Stoppelman <[EMAIL PROTECTED]>:
 So you're saying searches are taking 10 seconds on a 5G index? If  
so that

seems ungodly slow.
If you're on *nix, have you watched your iostat statistics? Maybe
something
is hammering your hds.
Something seems amiss.

What lucene methods were pointed to as hotspots by YourKit?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Inconsistent Search Speed

2008-02-27 Thread Erick Erickson
To reinforce Grant's comment, lazy loading improved one situation for me
on the order of 10X. I wrote it up and it's somewhere in the Wiki. Your
results
will vary, and unless you have a LOT of stored fields I wouldn't necessarily
expect a similar speedup, but it's sure worth looking at.

And don't iterate through the Hits object for more than 100 or so hits. Like
Mark said. Really. Really don't ...

Best
Erick

On Wed, Feb 27, 2008 at 7:33 AM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> You could also look at the FieldSelector when getting the Document.
> Such that you only load the one field you need
>
> -Grant
>
> On Feb 26, 2008, at 10:13 PM, Mark Miller wrote:
>
> > The Lucene prime directive: dont iterate through all of Hits! Its
> > horribly inefficient. You must use a hitcollector. Even still,
> > getting your field values will be slow no matter what if you get for
> > every hit. You don't want to do this for every hit in a search. But
> > don't loop through Hits.
> >
> > fangz wrote:
> >> Thank you for the info.  It makes sense.
> >> My search will return more than 1 documents and I have to loop
> >> through
> >> all documents to build a list with unique field values. It seems
> >> that the
> >> looping of the hits takes the longest time in the initial run but
> >> afterwards
> >> it becomes much faster. If the hits are relatively small, I do not
> >> see the
> >> same behavior.
> >>
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> --
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: lucene java OOM while sorting more than one field

2008-02-27 Thread Erick Erickson
The first question is always "how much memory are you giving
your JVM?".

A 256M index is pretty small, I wouldn't be surprised if your JVM is using
some vary small default

Best
Erick

On Wed, Feb 27, 2008 at 6:23 AM, GURUPRASAD MS <[EMAIL PROTECTED]>
wrote:

>  Lucene Index contains 2.1 Million records (indexed from 2.1 million
> records
> from sqlserver DB).
> Lucene Index file Size 256MB
> Lucene version: 2.3
> Searching works fine when we sort the results on a single field. However,
> if
> the search results is sorted on more than one field we get Out of Memory
> exception.
> We restrict the number of search results to 250
>
> Out of Memory exception is quite consistent in 2.3. We recently moved from
> 2.0 to 2.3.
> Version 2.0 also used to give the OOM but not this frequent.
>
>
> *Code Snippet:*
>
> final QueryParser parser;
> StandardAnalyzer analyzer = new StandardAnalyzer();
> String sSearchQuery = "MUSIC";
> Sort oSort = null;
> public static final String G2KEYFIELDS = "G2KEYFIELDS";
> oSort = new Sort(new SortField[]{ new SortField (
> GroupsConstants.LAST_MODIFIED,SortField.INT,true), new SortField(
> GroupsConstants.GROUPNAME) });
>
> parser = new QueryParser(G2KEYFIELDS, analyzer);
> final Vector ids = new Vector();
> FSDirectory dir = null;
> IndexSearcher searcher = null;
> try {
> dir = FSDirectory.getDirectory(index);
> searcher = new IndexSearcher(dir);
> Query query = parser.parse(sSearchQuery);
> Hits hits = searcher.search(query, oSort);
> for (int i = 0; i != hits.length() && i != 250; ++i) {
> final Document doc = hits.doc;
> Integer oiGroupId=new Integer(doc.getField (GroupsConstants.IDENTITY
> ).stringValue());
> if(!ids.contains(oiGroupId)){ ids.addElement(oiGroupId); }
> }
> searcher.close();
> dir.close();
>
>
> Thanks in Advance
>


Re: Atomicity and AutoCommit

2008-02-27 Thread Michael McCandless


When you previously saw corruption was it due to an OS or machine
crash (or power cord got pulled)?  If so, you were likely hitting
LUCENE-1044, which is fixed on the trunk version of Lucene (to be 2.4
at some point) but is not fixed in 2.3.

If that is what you were hitting, then unfortunately neither buffering
updates into RAM nor using autoCommit=false in 2.3 will fully protect
you from this issue.  Though, both of these approaches should reduce
your chance of hitting LUCENE-1044 since they both reduce frequency of
commits to the index.

Mike

Simon Wistow wrote:


I currently have a set up that indexes into RAM and then periodically
merges that into a disk based index.

Searches are done from the disk based index and deletes are handled by
keeping a list of deleted documents, filtering out search results and
applying the deletes to the index at merge time.

All this was done to make sure that we didn't corrupt the index (which
we'd seen happen a few times when the indexing machine failed for
whatever reason). With this scheme if the machine fails then all  
that's
lost is the RAM index and the list of deletes. We then just simply  
play

back all actions since the last merge and we're back to where we
started.

However it occurred to me that this might all be redundant now with
Lucene 2.3 (it's possible it might have always been redundant come to
think of it) - should I just open a Disk based Index with
autocommit=false and then periodically commit the changes by close() 
ing

and then re-open()ing the Disk index ? Is that atomic? i.e is there a
situation using this whereby the index could become corrupted?

Thanks,

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



explain() - fieldnorm

2008-02-27 Thread JensBurkhardt

Hey everybody,

As my subject is telling, i have a little problem with analyzing the
explain() output.
I know, that the fieldnorm value consists out of "documentboost, fieldboost
and lengthNorm". 
Is is possible to recieve the single values? I know that they are multiplied
while indexing but
can they be stored so that i can read them when i analyze my search?
The Problem is, that i have 2 Documents I want to compare but the only
difference is the fieldnorm value
and i don't know which value exactly makes this difference.

Best regards
Jens Burkhardt
-- 
View this message in context: 
http://www.nabble.com/explain%28%29---fieldnorm-tp15717182p15717182.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Inconsistent Search Speed

2008-02-27 Thread fangz

I implemented HitCollector as you suggested. It improved the initial run
significantly. However it only showed slight improvement in the subsequent
runs. I don't know how to implement FieldSelector in my situation. My codes
look like this:

public void collect( int doc, float score ) {

TermFreqVector vector = null;
vector = searcher.getIndexReader().getTermFreqVector(doc, "field");
...

Thank you again!

fangz
-- 
View this message in context: 
http://www.nabble.com/Inconsistent-Search-Speed-tp15698325p15719770.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Atomicity and AutoCommit

2008-02-27 Thread Simon Wistow
On Wed, Feb 27, 2008 at 09:38:55AM -0500, Michael McCandless said:
> 
> When you previously saw corruption was it due to an OS or machine
> crash (or power cord got pulled)?  If so, you were likely hitting
> LUCENE-1044, which is fixed on the trunk version of Lucene (to be 2.4
> at some point) but is not fixed in 2.3.

Yes - it's power outages and other unnatural events (sysadmins 
accidentally kill -9ing the process) that caused it.

What's the chances of me backporting the fix to 2.3 or should I just 
wait for 2.4?

Come 2.4 is my buffering to RAM redundant?

Thanks,

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Atomicity and AutoCommit

2008-02-27 Thread Mark Miller
You need to make sure your storage does not lie in response to an fsync 
command. If it does (most commercial stuff does), you cannot guaranty no 
corruption. Search google for "your harddrive lies to you" or something.


It shouldnt be that hard to take the patch from the issue and apply it 
to a checked out version of 2.3 right? I don't think it relies on other 
2.4 stuff as there isnt much of it yet.


Simon Wistow wrote:

On Wed, Feb 27, 2008 at 09:38:55AM -0500, Michael McCandless said:
  

When you previously saw corruption was it due to an OS or machine
crash (or power cord got pulled)?  If so, you were likely hitting
LUCENE-1044, which is fixed on the trunk version of Lucene (to be 2.4
at some point) but is not fixed in 2.3.



Yes - it's power outages and other unnatural events (sysadmins 
accidentally kill -9ing the process) that caused it.


What's the chances of me backporting the fix to 2.3 or should I just 
wait for 2.4?


Come 2.4 is my buffering to RAM redundant?

Thanks,

Simon



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Atomicity and AutoCommit

2008-02-27 Thread Michael McCandless


Simon Wistow wrote:


On Wed, Feb 27, 2008 at 09:38:55AM -0500, Michael McCandless said:


When you previously saw corruption was it due to an OS or machine
crash (or power cord got pulled)?  If so, you were likely hitting
LUCENE-1044, which is fixed on the trunk version of Lucene (to be 2.4
at some point) but is not fixed in 2.3.


Yes - it's power outages and other unnatural events (sysadmins
accidentally kill -9ing the process) that caused it.


OK power outage can definitely cause corruption.  This has been a long
standing, but only recently uncovered, and now fixed in 2.4, issue
(LUCENE-1044).  But I believe kill -9 should not cause corruption.

BTW hot backups, as of 2.3, are now very easy.  Just use
SnapshotDeletionPolicy when creating your writer.  Making frequent
backups is a good safeguard too...


What's the chances of me backporting the fix to 2.3 or should I just
wait for 2.4?


It unfortunately was a fairly large change; I'm not sure how cleanly
the patch will apply to 2.3.  Maybe try trunk (but beware: the index
format changed with LUCENE-1044 to add an integrity checksum to
the end of the segments_N file)...


Come 2.4 is my buffering to RAM redundant?


Well, as Mark said, if your IO system does not lie on fsync, then  
buffering
to RAM is redundant.  If it does lie, you still have open risk of  
corruption and

so buffering to RAM probably reduces (but doesn't eliminate) the risk.

Also, as of 2.3, manually buffering to RAMDirectory should no longer
give a big performance win over just giving that RAM to the
IndexWriter as its buffer instead.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document.setBoost() doesn't work

2008-02-27 Thread Soeren Pekrul

I work with Lucene 2.0. I boost some documents:

Document doc = new Document();
// adding fields
doc.setBoost(2.0f);
indexwriter.addDocument(doc);

If I look to my index with Luke (0.6) the boost value of all documents 
is still 1.0.

How can I boost documents?

Thanks. Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



filter issue

2008-02-27 Thread Rong Shen
Hi List,

I have a situation similar to indexing a mailing list, with each mail
indexed as a Doc. Mails from a same thread share a same thread ID, which is
indexed in a separate field. Now I want to search through all the mails
using some keywords, and list all the unique thread IDs which I can pass to
the database calls.

I tried DuplicateFilter, which didn't work well - by missing some results. I
went through the code, and found all the filters are basically pre-filters,
in other words, they generate the bitsets based on the index, and filter the
duplicates out (in the case of DuplicateFilter) before being applied to the
result collector. It causes problem when some mails contain the searching
keywords but were filtered out as they were set to false in the bitset
aready.

Any solutions for this? is there any sort of post-filtering things exist,
that filter records in the search result (could be slow), rather than in the
whole collection? Thanks.


Re: Inconsistent Search Speed

2008-02-27 Thread Grant Ingersoll
Ah, you didn't mention term vectors.  What do you need them for?   
Perhaps a bit more background could help here.


-Grant

On Feb 27, 2008, at 1:31 PM, fangz wrote:



I implemented HitCollector as you suggested. It improved the initial  
run
significantly. However it only showed slight improvement in the  
subsequent
runs. I don't know how to implement FieldSelector in my situation.  
My codes

look like this:

public void collect( int doc, float score ) {

   TermFreqVector vector = null;
   vector = searcher.getIndexReader().getTermFreqVector(doc, "field");
   ...

Thank you again!

fangz
--
View this message in context: 
http://www.nabble.com/Inconsistent-Search-Speed-tp15698325p15719770.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How do i get a text summary

2008-02-27 Thread Ravinder.Teepiredddy
Hi All,

 

Is there a way to get a text summary of an indexed document to display
along with the search result?

Please let me know the technical changes.

 

Thanks,

Ravinder

 



DISCLAIMER:
This message contains privileged and confidential information and is intended 
only for an individual named. If you are not the intended recipient, you should 
not disseminate, distribute, store, print, copy or deliver this message. Please 
notify the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete or contain viruses. The 
sender, therefore,  does not accept liability for any errors or omissions in 
the contents of this message which arise as a result of e-mail transmission. If 
verification is required, please request a hard-copy version.


Re: How do i get a text summary

2008-02-27 Thread Seth Call
Am I missing something?  Isn't this exactly what Lucene does?

Put in a value when you create your Document, get it back out when it comes
back from a search, right?

Want a text summary? Put it in to the document...

I just started playing with Lucene so maybe I'm missing something, but these
question seems quite fundamental to what Lucene is all about.

On Wed, Feb 27, 2008 at 8:57 PM, <[EMAIL PROTECTED]>
wrote:

> Hi All,
>
>
>
> Is there a way to get a text summary of an indexed document to display
> along with the search result?
>
> Please let me know the technical changes.
>
>
>
> Thanks,
>
> Ravinder
>
>
>
>
>
> DISCLAIMER:
> This message contains privileged and confidential information and is
> intended only for an individual named. If you are not the intended
> recipient, you should not disseminate, distribute, store, print, copy or
> deliver this message. Please notify the sender immediately by e-mail if you
> have received this e-mail by mistake and delete this e-mail from your
> system. E-mail transmission cannot be guaranteed to be secure or error-free
> as information could be intercepted, corrupted, lost, destroyed, arrive late
> or incomplete or contain viruses. The sender, therefore,  does not accept
> liability for any errors or omissions in the contents of this message which
> arise as a result of e-mail transmission. If verification is required,
> please request a hard-copy version.
>



-- 
The poor have to labour in the face of the majestic equality of the law,
which forbids the rich as well as the poor to sleep under bridges, to beg in
the streets, and to steal bread.


RE: Document.setBoost() doesn't work

2008-02-27 Thread John Griffin
Soren,

Your documents are being boosted. Because of the way document boost values
immediately go through some calculations and are stored in the index Luke
will always show 1.o as the boost value. There has been some talk in the
recent past that this should be removed from Luke since it is actually
misleading. 

If you query your index and use the explain method o the results you will
see that they are being boosted.

Hope this helps

John G.

-Original Message-
From: Soeren Pekrul [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 27, 2008 5:16 PM
To: java-user@lucene.apache.org
Subject: Document.setBoost() doesn't work

I work with Lucene 2.0. I boost some documents:

Document doc = new Document();
// adding fields
doc.setBoost(2.0f);
indexwriter.addDocument(doc);

If I look to my index with Luke (0.6) the boost value of all documents 
is still 1.0.
How can I boost documents?

Thanks. Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How do i get a text summary

2008-02-27 Thread John Griffin
Ravinder,

If you want something from an index it has to be IN the index. So, store a
summary field in each document and make sure that field is part of the
query.

John G.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 27, 2008 7:58 PM
To: java-user@lucene.apache.org
Subject: How do i get a text summary 

Hi All,

 

Is there a way to get a text summary of an indexed document to display
along with the search result?

Please let me know the technical changes.

 

Thanks,

Ravinder

 



DISCLAIMER:
This message contains privileged and confidential information and is
intended only for an individual named. If you are not the intended
recipient, you should not disseminate, distribute, store, print, copy or
deliver this message. Please notify the sender immediately by e-mail if you
have received this e-mail by mistake and delete this e-mail from your
system. E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed, arrive late
or incomplete or contain viruses. The sender, therefore,  does not accept
liability for any errors or omissions in the contents of this message which
arise as a result of e-mail transmission. If verification is required,
please request a hard-copy version.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How do i get a text summary

2008-02-27 Thread Ravinder.Teepiredddy
Hi John,

I am getting Summary value null in results.jsp page and I need "snippet"
or "fragment" to be highlighted. 
I have gone through lucene faqs related but it's not clear. I will
appreciate if you help me to find list of files (Java) to be modified.

Thanks in advance.
Ravinder

-Original Message-
From: John Griffin [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 28, 2008 11:50 AM
To: java-user@lucene.apache.org
Subject: RE: How do i get a text summary 

Ravinder,

If you want something from an index it has to be IN the index. So, store
a
summary field in each document and make sure that field is part of the
query.

John G.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 27, 2008 7:58 PM
To: java-user@lucene.apache.org
Subject: How do i get a text summary 

Hi All,

 

Is there a way to get a text summary of an indexed document to display
along with the search result?

Please let me know the technical changes.

 

Thanks,

Ravinder

 



DISCLAIMER:
This message contains privileged and confidential information and is
intended only for an individual named. If you are not the intended
recipient, you should not disseminate, distribute, store, print, copy or
deliver this message. Please notify the sender immediately by e-mail if
you
have received this e-mail by mistake and delete this e-mail from your
system. E-mail transmission cannot be guaranteed to be secure or
error-free
as information could be intercepted, corrupted, lost, destroyed, arrive
late
or incomplete or contain viruses. The sender, therefore,  does not
accept
liability for any errors or omissions in the contents of this message
which
arise as a result of e-mail transmission. If verification is required,
please request a hard-copy version.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DISCLAIMER:
This message contains privileged and confidential information and is intended 
only for an individual named. If you are not the intended recipient, you should 
not disseminate, distribute, store, print, copy or deliver this message. Please 
notify the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete or contain viruses. The 
sender, therefore,  does not accept liability for any errors or omissions in 
the contents of this message which arise as a result of e-mail transmission. If 
verification is required, please request a hard-copy version.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Performance

2008-02-27 Thread h t
1. redefine the archivedate field as YYmmDD format,
2. add another field using timestamp for sort use.
3. use RangeFilter to get result and then sort by timestamp.

2008/2/27, Jamie <[EMAIL PROTECTED]>:
>
> Hi Michael & Others
>
> Ok. I've gathered some more statistics from a different machine for your
> analysis.
> (I had to switch machines because the original one was in production and
> my tests were interfering).
>
> Here are the statistics from the new machine:
>
> Total Documents: 1.2 million
> Results Returned:  900k
> Store Size 238G (size of original documents)
> Index Size 1.2G (lucene index size)
> Index / Store Ratio 0.5%
>
> The search query is as follows:
>
> archivedate:[d2007122901 TO d20080228235900]
> ~~~why there is an
> extra 'd' ?
> As you can see, I am using a range query to search between specific dates.
> Question: should this query be moved to a filter rather? I did not do
> this as I needed to have the option to sort on date.
>
> There are no other specific filters applied and in this example sorting
> is turned off.
>
> On this particular machine the search time varies between 2.64 seconds
> and about 5 seconds.
>
> The limitations of this machine are that it does uses a normal IDE drive
> to house the index, not a SATA drive
>
> IOStat Statistics
>
> Linux 2.6.20-15-server 27/02/2008
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>  20.250.003.230.340.00   76.19
>
> Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda   7.1250.67   186.41   38936841  143240688
>
> See attached for hardware info and the CPU call tree (taken from YourKit).
>
> I would appreciate your recommendations.
>
>
> Jamie
>
>
> h t wrote:
> Hi Michael,
> I guess the hotspot of lucene is
> org.apache.lucene.search.IndexSearcher.search()
>
> Hi Jamie,
> What's the original text size of a million emails?
> I estimate the size of an email is around 100k, is this true?
> When you doing search, what kind keywords did you input, words or short
> sentence?
> How many results return?
> Did you use filter to shrink the results size?
>
> 2008/2/27, Michael Stoppelman <[EMAIL PROTECTED]>:
>   So you're saying searches are taking 10 seconds on a 5G index? If so
> that
> seems ungodly slow.
> If you're on *nix, have you watched your iostat statistics? Maybe
> something
> is hammering your hds.
> Something seems amiss.
>
> What lucene methods were pointed to as hotspots by YourKit?
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Lucene Search Performance

2008-02-27 Thread Jamie

Hi

Thanks for the suggestions. This would require us to change the index 
and right now we literally have millions of documents stored in current 
index format. I'll bear it in mind, but I am not entirely sure how I 
would go about implementing the change at this point.


Much appreciate

Jamie


h t wrote:

1. redefine the archivedate field as YYmmDD format,
2. add another field using timestamp for sort use.
3. use RangeFilter to get result and then sort by timestamp.

2008/2/27, Jamie <[EMAIL PROTECTED]>:
  

Hi Michael & Others

Ok. I've gathered some more statistics from a different machine for your
analysis.
(I had to switch machines because the original one was in production and
my tests were interfering).

Here are the statistics from the new machine:

Total Documents: 1.2 million
Results Returned:  900k
Store Size 238G (size of original documents)
Index Size 1.2G (lucene index size)
Index / Store Ratio 0.5%

The search query is as follows:

archivedate:[d2007122901 TO d20080228235900]
~~~why there is an
extra 'd' ?
As you can see, I am using a range query to search between specific dates.
Question: should this query be moved to a filter rather? I did not do
this as I needed to have the option to sort on date.

There are no other specific filters applied and in this example sorting
is turned off.

On this particular machine the search time varies between 2.64 seconds
and about 5 seconds.

The limitations of this machine are that it does uses a normal IDE drive
to house the index, not a SATA drive

IOStat Statistics

Linux 2.6.20-15-server 27/02/2008

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 20.250.003.230.340.00   76.19

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda   7.1250.67   186.41   38936841  143240688

See attached for hardware info and the CPU call tree (taken from YourKit).

I would appreciate your recommendations.


Jamie


h t wrote:
Hi Michael,
I guess the hotspot of lucene is
org.apache.lucene.search.IndexSearcher.search()

Hi Jamie,
What's the original text size of a million emails?
I estimate the size of an email is around 100k, is this true?
When you doing search, what kind keywords did you input, words or short
sentence?
How many results return?
Did you use filter to shrink the results size?

2008/2/27, Michael Stoppelman <[EMAIL PROTECTED]>:
  So you're saying searches are taking 10 seconds on a 5G index? If so
that
seems ungodly slow.
If you're on *nix, have you watched your iostat statistics? Maybe
something
is hammering your hds.
Something seems amiss.

What lucene methods were pointed to as hotspots by YourKit?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  



--
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: [EMAIL PROTECTED]
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a message to: <[EMAIL PROTECTED]> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query regarding usage of Lucene - Filtering folders

2008-02-27 Thread Mohammad.Ahmed
Hi

I would like to join java-user mailing list.

I had a query regarding usage of lucene.

 

 

I have done the indexing for the files kept in root folder -> subfolder

-> subfolder structure.

 

 

When I make the search with particular word it returns me the list of
matching files across the folder structure right from root to the last
subfolder.

 

 

I want to restrict search to specific folder only.

 

Is it possible with lucene?

 

If yes, please suggest me the steps to follow.

 

 

Thanks,

 

Mohammad

 

 



DISCLAIMER:
This message contains privileged and confidential information and is intended 
only for an individual named. If you are not the intended recipient, you should 
not disseminate, distribute, store, print, copy or deliver this message. Please 
notify the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete or contain viruses. The 
sender, therefore,  does not accept liability for any errors or omissions in 
the contents of this message which arise as a result of e-mail transmission. If 
verification is required, please request a hard-copy version.


Query regarding usage of Lucene(Filtering folder)

2008-02-27 Thread Ravinder.Teepiredddy
Hi All,

 

I had a query regarding usage of lucene.

I have done the indexing for the files kept in root folder ->
subfolder-> Subfolder structure.

When I make the search with particular word it returns me the list of
matching files across the folder structure right from root to the last
subfolder.

I want to restrict search to specific folder only.

Is it possible with lucene?

If yes, please suggest me the steps to follow.

 

Thanks,

Mohammad

 

 



DISCLAIMER:
This message contains privileged and confidential information and is intended 
only for an individual named. If you are not the intended recipient, you should 
not disseminate, distribute, store, print, copy or deliver this message. Please 
notify the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete or contain viruses. The 
sender, therefore,  does not accept liability for any errors or omissions in 
the contents of this message which arise as a result of e-mail transmission. If 
verification is required, please request a hard-copy version.