Hi,
 
I'm working on a pretty typical web page search system based on lucene.
Pretty much everything works great.  However, I'm having one problem.  I
want to have a feature in this system where I can find all pages which link
to another page.  So, for instance, I might search for all the pages linked
to http://www.foobar.com/index.html.  The search term does not need to be
fuzzy in any way.  http://www.foobar.com would not match
http://www.foobar.com/.  The thing is that any for any given document I
could have any number of associated links.
 
I think that each page's links could be treated as an array of keywords.
However, I don't know the best practice for indexing this data or how to
find matches for specific links.  
 
I tried creating a LinebreakAnalyzer which (I think) tokenized phrases based
on CRs and LFs. I converted the array of links to a list of links delimited
by LFs.  When indexing I used the PerFieldAnalyzerWrapper and set the links
field to use the LinebreakAnalyzer.  My understanding is that the lucene
index should now have each of the links indexed as separate terms or
keywords (sorry if my vocabulary is wrong!)
 
Now, all that seems to work fine.  However, when I search I build I query
using this code:
 
QueryParser.parse(link, "links", new LinebreakAnalyzer())
 
The link is the link I'm searching for, "links" is the field I'm searching.
I'm using the same analyzer I used to index the links.  The problem is I
don't get any matches at all when I execute the search.
 
Does anyone know of any better techniques for this?  Or does anyone see
anything I'm doing wrong?
 
Thanks in advance for the help.
 
Doug Hughes
[EMAIL PROTECTED]

Reply via email to