Re: HTML pages highlighter

Erik Hatcher Mon, 04 Apr 2005 17:45:37 -0700

On Apr 4, 2005, at 5:35 PM, Yagnesh Shah wrote:

I end up purchasing your book "Lucene in Action". I have downloaded your code samples. I am able to retrieve "result" only some time. Below is the code I have taken from Search.jhtml in lucene demo. I have 2 problem

a) I am unable to display "result" using b) When I click on the title to retrieve document I do not see my query highlighted.

First things first.... get something very very simple working and expand from there. Here is the simple code from our HighlightIt.java:

    TermQuery query = new TermQuery(new Term("f", "ipsum"));
    QueryScorer scorer = new QueryScorer(query);
    SimpleHTMLFormatter formatter =
        new SimpleHTMLFormatter("<span class=\"highlight\">",
            "</span>");
    Highlighter highlighter = new Highlighter(formatter, scorer);
    Fragmenter fragmenter = new SimpleFragmenter(50);
    highlighter.setTextFragmenter(fragmenter);

    TokenStream tokenStream = new StandardAnalyzer()
        .tokenStream("f", new StringReader(text));

    String result =
        highlighter.getBestFragments(tokenStream, text, 5, "...");

One trick is that you must ensure the query you are passing to QueryScorer has been rewritten. In our simple TermQuery case, that is not necessary, but in a general application it is. You can call query.rewrite(reader) where reader is your IndexReader instance. This ensures that range, fuzzy, and wildcard queries are expanded and highlightable.

I'm not sure what is wrong with the code you are trying. But again, start simple, just try out our HighlightIt or our HighlightTest. If those work fine for you then move on to integrating further with your index. Besides the Query.rewrite() trick, you have to be sure that the text you want to highlight is available. If you're pulling it from the index, it must be in a stored field, otherwise you need to retrieve it from elsewhere.

        Erik

<java>
 Searcher searcher = new IndexSearcher(getReader(indexName));
 // get query from request
 String queryString = request.getParameter("query");
query = QueryParser.parse(queryString, "contents", analyzer); Hits hits = searcher.search(query); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(); Highlighter highlighter = new Highlighter(formatter, new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(50)); String FIELD_NAME = "contents";

for (int i = start; i < end; i++) { // display the hits Document doc = hits.doc(i); String text = hits.doc(i).get(FIELD_NAME); int maxNumFragmentsRequired = 5; String fragmentSeparator = "..."; if ( text != null){ TokenStream tokenStream = new StandardAnalyzer().tokenStream(FIELD_NAME, new java.io.StringReader(text)); String result = highlighter.getBestFragments (tokenStream,text,maxNumFragmentsRequired,fragmentSeparator); System.out.println("result=" +result); }

String title = doc.get("title"); if (title.equals("")) // use url for docs w/o title title = doc.get("path"); </java> <java type=print>(int)(hits.score(i) * 100.0f)</java>% <a href="`doc.get("path")`"> <java type=print>Entities.encode(title)</java> </a> <java> if (showSummaries) { // maybe show summary </java> <ul>Summary: <java type=print>Entities.encode(doc.get("summary"))</java> </ul> <java> } } </java>
-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 31, 2005 8:04 PM
To: [email protected]
Subject: Re: HTML pages highlighter
On Mar 31, 2005, at 6:36 PM, Yagnesh Shah wrote:
 try {
 fis = new FileInputStream(f);
 HTMLParser parser = new HTMLParser(fis);
 // Add the tag-stripped contents as a Reader-valued Text field
so it will
 // get tokenized and indexed.
// doc.add(new Field("contents", parser.getReader()));
 LineNumberReader reader = new
LineNumberReader(parser.getReader());
 for (String l = reader.readLine(); l != null; l =
reader.readLine())
// System.out.println(l);
 doc.add(Field.Text("contents", l));
Notice that your loop here is adding a "contents" field for *every*
line read since that is where the first semi-colon is.
Look at using Luke to explore your index. Try indexing just a dummy
String:
 doc.add(Field.Text("contents", "some dummy text"));
to show that it works. Always always always simplify a complicated
situation by doing the most obvious thing that _should_ work.
Also, the demo Lucene code is not really designed to be used in a
production application (sadly), so you're better off borrowing code
from the many articles or our book to begin with.
 Erik
 // Add the summary as a field that is stored and returned with
 // hit documents for display.
 doc.add(new Field("summary", parser.getSummary(),
Field.Store.YES, Field.Index.NO));
 // Add the title as a field that it can be searched and that is
stored.
 doc.add(new Field("title", parser.getTitle(), Field.Store.YES,
Field.Index.TOKENIZED));
 }
-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 7:38 PM
To: [email protected]
Subject: Re: HTML pages highlighter
On Mar 30, 2005, at 4:46 PM, Yagnesh Shah wrote:
Hi! Eric,
Erik - with a 'k' - Sorry, I let it slide once though :)
I try to modified that with this but I get compile error. Do you have any code snippet of highlighting code to pull the contents from the original source?
I have a whole book full of code examples :)
http://www.lucenebook.com - Grab the source code and look in
src/lia/tools at Highlight*.java
 or Do you know how I can do field store?
 doc.add(new Field("contents", parser.getReader(),
Field.Store.YES, Field.Index.NO));
You cannot store it with a Reader. You need to use Field.Text(String,
String), or one of the other variations.
 Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML pages highlighter

Reply via email to