Re: document field updates

Andrzej Bialecki Thu, 01 Mar 2007 08:41:22 -0800

Erik Hatcher wrote:

I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text?
Nutch does do this sort of thing, but I'm not quite sure how. Itisn't doing any operations to the Lucene index beyond what plain ol'Lucene does.

Nutch maintains a set of separate DBs (using HadoopMapFile/SequenceFile), where inlinks are stored (together with theiranchor text). During indexing this data is pulled in from the DBs pieceby piece using the URLs as "primary keys".

Nutch doesn't update _any_ data structures in-place - all "update"operations involve creating new data files and optionally deleting olddata files. This includes also indexes - new indexes are being createdfrom newly updated pages, and then only individual Lucene documents aredeleted from older indexes to get rid of duplicates. After a while,really old indexes are removed completely, because their content islikely to be present in one of the newer indexes.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: document field updates

Reply via email to