Re: Removing similar documents from search results

Chris Hostetter Sun, 20 Mar 2005 00:49:22 -0800

: At the moment I need something quite simple. To identify a page that
: appears in many forms, e.g.:
:
: - Normal version
: - Split across several pages
: - Print version
: - From a different section (different styling and navigation elements)
:
: Basically identical content, presented in different ways.


Actually, your "Split across several pages" comment implies that you want
a system which can tell that page 1 of a multipage article should be
grouped with page 2 -- which may be radically different content.  Most
multipage documents have very differnet text on subsequent pages, so i'm
not sure that a progromatic solution is going to be bale to spot that.

I may also be reading too much into your message, but it sounds like you
aren't trying to index generic content -- it sounds like you are trying to
index content under your control (ie: content on your own web site).

if that's the case, then presumably you know somethign about the
source data and the URL strucutre -- maybe you could solve this problem
when you build your index.

for example, if i look at a site like perl.com, i can see a pattern in the
way the article URLs look...

page 1...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html
page 2, etc...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
printable...
http://www.perl.com/lpt/a/2005/02/17/3d_engine.html


So instead of putting all of those URLs in the index as seperate docs, why
not createa single doc, with all of those URLs?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Removing similar documents from search results

Reply via email to