RE: Indexing multiple keywords in one field?

Doug Hughes Sun, 29 May 2005 17:27:50 -0700

Eric,

Thanks for your reply.  I guess I didn't describe the problem very well.  I
had already parsed the html and extracted links.  I had the links easily
available in an arraylist.  I wanted to place the data from that array
(which happened to be links) into the index so that I could match documents
according to where they link.

However, that's rather irrelevant.  I took the day off today which allowed
my mind to think about this a bit. And, as a result, I realized had had my
logic in my LinebreakAnalyzer backwards.  Rather than tokenizing the string
based on line breaks, I was tokenizing it on everything EXCEPT line
breaks... Which explains why nothing matched.  Oops.  

So, it works now.  This might be somewhat presumptuous of me, but it might
be useful for Lucene to include a DelimitedTextAnalyzer and Tokenizer.  The
constructor for them might accept an array of characters which could be used
as delimiters between terms which should be indexed into a particular field.
I've all but written this, if anyone's interested.

Doug

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Sunday, May 29, 2005 7:39 PM
To: java-user@lucene.apache.org
Subject: Re: Indexing multiple keywords in one field?

On May 29, 2005, at 8:29 AM, Doug Hughes wrote:

> Hi,
>
> I'm working on a pretty typical web page search system based on 
> lucene.
> Pretty much everything works great.  However, I'm having one problem.  
> I want to have a feature in this system where I can find all pages 
> which link to another page.  So, for instance, I might search for all 
> the pages linked to http://www.foobar.com/index.html.  The search term 
> does not need to be fuzzy in any way.  http://www.foobar.com would not 
> match http://www.foobar.com/.  The thing is that any for any given 
> document I could have any number of associated links.
>
> I think that each page's links could be treated as an array of 
> keywords.
> However, I don't know the best practice for indexing this data or how 
> to find matches for specific links.

One possibility is to extract the links (XPath could do this with a "//a"
pattern) during a parsing phase, not during an analyzer.  Build a list of
links and index each one as a separate Field.Keyword() field for a single
document.

> I tried creating a LinebreakAnalyzer which (I think) tokenized phrases 
> based on CRs and LFs. I converted the array of links to a list of 
> links delimited by LFs.  When indexing I used the 
> PerFieldAnalyzerWrapper and set the links field to use the 
> LinebreakAnalyzer.  My understanding is that the lucene index should 
> now have each of the links indexed as separate terms or keywords 
> (sorry if my vocabulary is wrong!)

Links are broken per line?  For general HTML parsing you certainly cannot
assume that, but maybe in your documents you can?  I'd be surprised at that
though.

> Now, all that seems to work fine.  However, when I search I build I 
> query using this code:
>
> QueryParser.parse(link, "links", new LinebreakAnalyzer())
>
> The link is the link I'm searching for, "links" is the field I'm 
> searching.
> I'm using the same analyzer I used to index the links.  The problem is 
> I don't get any matches at all when I execute the search.
>
> Does anyone know of any better techniques for this?  Or does anyone 
> see anything I'm doing wrong

The first thing to do is ensure what you think was indexed really was.  I
highly recommend you get Luke - http://www.getopt.org/luke/ - and explore
the index you've built and see what terms were indexed in your links field.
Then experiment using the TermQuery API for the _exact_ terms indexed.  Only
then move up to QueryParser if you need that kind of thing, using
Query.toString() to dump the generated query instance and see what it is
made of.  QueryParser introduces a level of complexity that can be confusing
because there is query expression operators, parsing, and analysis all mixed
together - and some characters in a URL within a QueryParser expression will
need to escaped to be interpreted properly (like the ":" in "http://";).

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing multiple keywords in one field?

Reply via email to