Re: lucene link database

Erick Erickson Sun, 08 Oct 2006 13:32:52 -0700

A quick word of caution about doc IDs. Lucene assigns a document id at index
time, but that ID is *not* guaranteed to remain the same for a given
document. For instance... you index docs A, B, and C. They get Lucene IDs 1,
2, 3. Then you remove doc B and optimize the index. As I understand it, doc
C will get re-assigned ID 2, and ID 3 won't exist.


In reality, I don't think that the algorithm is quite as simplistic as that,
but that's the idea. So be sure to assign your own unique identifiers that
you add to your docs as a field value.

Others on this list have talked abouta hybrid solution. That is, have *both*
lucene and a database, each doing what they do best. It's more complicated,
especially keeping the two in synch. some tools have been mentioned, I think
if you search the archive for database you'll bet a bunch of threads. But I
thought I'd mention it..

Best of luck
Erick

On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:


Dear Erick;

Thank you for your detailed insight. I have been trying to code a graph
object database for sometime.
I have prototyped on relational as well as object oriented databases,
including opensource and commercial implementations.
(so far, I have tried hibernate, objectivity/db, db4o) while object
databases excel in traversing links, they are poor when searching.

lucene so far solves the problem of solving. I am thinking of a document
as a list of tuples. (sequence of fields) and I can do searches with
lucene, it is really nice.

now I have to solve the problem of linking. if I keep the nodes with a
lucene index, and I can fetch documents with a doc_id, or some sort of
surrogate identifier, and
use those identifiers as node_id in an object graph, that will be what I
want. but in order to do that I need to be able to query the lucene
index by document_id.

I was referring to the link db of the nutch. They do have some sort of
link db implementation, that runs with hadoop, but I have not understood
the full code.
I am trying to understand the structure of this link database. I was
thinking of using documents with src and dst fields, that have document
id's as values. (one idea, I will try it tomorrow)

Again thanks a bunch.

Best Regards,
C.B.

Erick Erickson wrote:
> Aproach it in whatever way you want as long as it solves your problem
> <G>.
>
> My first question is why use lucene? Would a database suit your needs
> better? Of course, I can't say. Lucene shines at full-text searching, so
> it's a closer call if you aren't searching on parts of text. By that I
> mean
> that if you're not searching on *parts* of your links, you may want to
> consider a DB solution.
>
> That said, and if I understand your requirement, you have a pretty
simple
> design. Each document has two fields, incominglinks and outgoing
> links. But
> see the note below. Lucene indexes what you give it, so the fact that
> some
> of the links aren't hypertext links is immaterial to Lucene. Since you
> control both the indexer and searcher, these confrom to whatever your
> requirements are. It's up to you to map semantics onto these entities.
>
> One common trap DB-savvy people have is that they think of documents as
> entries in a table, all with the same fields. There is nothing
> requiring you
> to have the *same* fields in each document in an index. You could have
an
> index for which no two documents shared *any* common field if you
choose.
>
> So, if you want to find out what, say, which documents have link X as an
> incoming link, just search on incominglinks:X. If you wanted to find the
> documents that had any incoming links X, Y, Z that matched an outgoing
> link
> in another document, just search the OR of these in outgoinglinks.
>
> If you want some kind of map of the whole web of links, you'll have to
> write
> some iterative loop and keep track. There's nothing built in that I
> know of
> that lets you answer "Given link X, show me all the documents no more
> than 3
> hops away". Lucene is an *engine*, designed to have apps built on top
> of it.
> Lucene doesn't deal with relations between documents, just searching
what
> you've indexed.
>
> It's easy enough to store a variable number of links in your
> incominglinks
> or outgoinglinks field. Just be sure they're tokenized appropriately.
You
> can add them any way you choose, either concatenate them all into a big
> string and index that, or index them into the same field, e.g.
> Document doc = new Document();
> doc.add("incoming", "link1");
> doc.add("incoming", "link2");
> .
> .
> .
> writer.add(doc);
>
> According to a discussion from a while ago, this is the same as
> doc.add("incoming", "link1 link2");
> in terms of how it all gets handled internally.
>
>
> NOTE: I'm skipping most of the question of which Analyzer you use.
> This will
> almost surely trip you up sometime. I'd suggest starting with
> WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
> will break your links up in ways you don't expect. Really, really,
really
> get a copy of Luke to see what's actually *in* your index and how
> searches
> work. And how the analyzer you choose changes what's searched for, as
> well
> as what's indexec. Google lucene luke and you'll find it.
>
> Anyway, hope this all helps.
> Erick
>
> On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
>>
>> Hello,
>>
>> I would like to make a link database using lucene. Similar to one that
>> nutch uses. I have read the basic documentation and understood how
>> document indexing, search, and scoring works. But what I like is
>> different documents having different kind of links (semantic links) to
>> each other. I would like to be able to search in the database like
>> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I
am
>> talking about, might not necessarily be hypertext links.
>>
>> How would I approach to a problem like this?
>>
>> Best Regards,
>> -C.B.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene link database

Reply via email to