Erik Hatcher wrote:

I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text?

Nutch does do this sort of thing, but I'm not quite sure how. It isn't doing any operations to the Lucene index beyond what plain ol' Lucene does.


Nutch maintains a set of separate DBs (using Hadoop MapFile/SequenceFile), where inlinks are stored (together with their anchor text). During indexing this data is pulled in from the DBs piece by piece using the URLs as "primary keys".

Nutch doesn't update _any_ data structures in-place - all "update" operations involve creating new data files and optionally deleting old data files. This includes also indexes - new indexes are being created from newly updated pages, and then only individual Lucene documents are deleted from older indexes to get rid of duplicates. After a while, really old indexes are removed completely, because their content is likely to be present in one of the newer indexes.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to