Erik Hatcher wrote:
I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text?
Nutch does do this sort of thing, but I'm not quite sure how. It
isn't doing any operations to the Lucene index beyond what plain ol'
Lucene does.
Nutch maintains a set of separate DBs (using Hadoop
MapFile/SequenceFile), where inlinks are stored (together with their
anchor text). During indexing this data is pulled in from the DBs piece
by piece using the URLs as "primary keys".
Nutch doesn't update _any_ data structures in-place - all "update"
operations involve creating new data files and optionally deleting old
data files. This includes also indexes - new indexes are being created
from newly updated pages, and then only individual Lucene documents are
deleted from older indexes to get rid of duplicates. After a while,
really old indexes are removed completely, because their content is
likely to be present in one of the newer indexes.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]