Re: Text storing design and performance question

Jason Pump Thu, 11 Jan 2007 13:02:34 -0800

Say my query is "apple banana orange". The word "apple" is near the start of
the document, "banana" and "orange" at the end. Wouldn't your optimization
stop at the word "apple" and just return this word highlighted?


Yes

Or do you know of a way to quantify the match?

I guess you could count how many <em> there are, I'm not that familiar with the highlighter. If there's not enough highlighted words you could always fall back to scanning the whole document. A good search result is one which has most of the terms together, near the start of the document. Most queries should have results that meet that criteria.





Renaud Waldura wrote:

Jason:

Interesting idea, thanks. But how do you know whether the highlighting is
any good? I thought highlighter implemented some kind of strategy to find
the best fragment.
Say my query is "apple banana orange". The word "apple" is near the start of
the document, "banana" and "orange" at the end. Wouldn't your optimization
stop at the word "apple" and just return this word highlighted? Or do you
know of a way to quantify the match?
-----Original Message-----
From: Jason Pump [mailto:[EMAIL PROTECTED]Sent: Wednesday, January 10, 2007 1:49 PM
To: java-user@lucene.apache.org
Subject: Re: Text storing design and performance question

Renaud, one optimization you can do on this is to try the first 10kb, see if
it finds text worth highlighting, if not, with a slight overlap try the next
9.9kb - 19.9kb or just 9.9kb -> end if you're feeling lazy.This assumes that most good matches are at the start of the document, and
that the files on disk are not compressed.

moraleslos wrote:
Maybe keeping the data in the DB would make it quicker? Seems likethe I/O performance would cause most of the performance issues you're
seeing.
-los


Renaud Waldura-5 wrote:
We used to store a big text field for highlighting purposes too, andit proved a big pain. The index was gigantic, it took forever tobuild, and the search performance would sometimes suffer from it(just a hunch).
Now we keep this big text field on disk (in a file), and feed it tothe highlighter. Unfortunately the highlighter has to read the file,parse it, etc... It's slooow, sometimes over a second on a large
document.
I vaguely remember reading somewhere newer versions of thehighlighter are able to leverage term vectors to avoid re-parsing thefield. (I could bemistaken.) Maybe just storing term vectors would keep the index leanand allow for fast highlighting?
--Renaud
-----Original Message-----
From: Mark Miller [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 10, 2007 9:54 AM
To: java-user@lucene.apache.org
Subject: Re: Text storing design and performance question
Being stateless should not be much of an issue. As Erick mentioned,the highlighter just expects you to pass it the query again and thetext to be highlighted. So when you show the pagination you just needto keep around what query generated the current page...then shoveeach piece of relevant text from the database through the highlighter(with the query) before displaying it.
- Mark

On 1/10/07, moraleslos <[EMAIL PROTECTED]> wrote:
Hi Mark,

Looks like I've got to implement some sort of pagination for my clients.
Problem is everything is stateless so looks like there's some work Ineed to do on my end. Thanks.
-los



markrmiller wrote:
Usually a user cannot easily browse 50,000 on a single display, andso you would only highlight the docs as they became visible to the
user.
This is generally a small amount...often one at a time.

- Mark

moraleslos wrote:
Hi Erik,
Would that slow performance a bit? For example, say I receive50,000 hits from a search. From your explanation, I have toretrieve the DB id
from
each hit, perform a query to the DB using the id to retrieve thefull contents for each hit, run highlighter on each content, andthen
return?
Although I'll give this a shot, it will seem to slow performanceon the searching side of things, wouldn't it? Thanks for the reply.
-los



Erik Hatcher wrote:
You don't have to store a field to highlight text. If you've gotit in your database, retrieve it from there and pass that stringto the highlighter instead.
    Erik


On Jan 10, 2007, at 10:45 AM, moraleslos wrote:
I'm running into a little dilemma with Lucene highlighting andindexing. I currently index anything and everything that getsinserted into a database.This database includes all the content that is searched. NowI'll have lots and lots of content, thinking of the range of
50GB+, all stored in the DB.
Using Lucene, I index all of this. But since I'm usinghighlighting features, I'll also need to store the content intothe index. Not sure what the performance implications areduring a search but I know that indexing performance should beslower as well as the index size being
enormous.
Because I have duplicated data, one in the index and the otherin the db, are there other ways of handling this situation in amore efficient and performant way? Thanks in advance.
-los
--
View this message in context:http://www.nabble.com/Text-storing-
design-and-performance-question-tf2953201.html#a8259883
Sent from the Lucene - Java Users mailing list archive at
Nabble.com.
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:[EMAIL PROTECTED]
-----------------------------------------------------------------
---- To unsubscribe, e-mail:[EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-------------------------------------------------------------------
-- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context:
http://www.nabble.com/Text-storing-design-and-performance-question-tf
2953201
.html#a8261739
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


--------------------------------------------------------------------
- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Text storing design and performance question

Reply via email to