Re: lucene link database

mark harwood Mon, 09 Oct 2006 01:45:12 -0700

>>if you search the archive for database you'll bet a bunch of threads


This was a hybrid implementation I did which worked with HSQLDB and Derby:

http://www.mail-archive.com/java-user@lucene.apache.org/msg02953.html


Cheers
Mark

----- Original Message ----
From: Erick Erickson <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, 8 October, 2006 8:33:59 PM
Subject: Re: lucene link database

A quick word of caution about doc IDs. Lucene assigns a document id at index
time, but that ID is *not* guaranteed to remain the same for a given
document. For instance... you index docs A, B, and C. They get Lucene IDs 1,
2, 3. Then you remove doc B and optimize the index. As I understand it, doc
C will get re-assigned ID 2, and ID 3 won't exist.

In reality, I don't think that the algorithm is quite as simplistic as that,
but that's the idea. So be sure to assign your own unique identifiers that
you add to your docs as a field value.

Others on this list have talked abouta hybrid solution. That is, have *both*
lucene and a database, each doing what they do best. It's more complicated,
especially keeping the two in synch. some tools have been mentioned, I think
if you search the archive for database you'll bet a bunch of threads. But I
thought I'd mention it..

Best of luck
Erick

On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
>
> Dear Erick;
>
> Thank you for your detailed insight. I have been trying to code a graph
> object database for sometime.
> I have prototyped on relational as well as object oriented databases,
> including opensource and commercial implementations.
> (so far, I have tried hibernate, objectivity/db, db4o) while object
> databases excel in traversing links, they are poor when searching.
>
> lucene so far solves the problem of solving. I am thinking of a document
> as a list of tuples. (sequence of fields) and I can do searches with
> lucene, it is really nice.
>
> now I have to solve the problem of linking. if I keep the nodes with a
> lucene index, and I can fetch documents with a doc_id, or some sort of
> surrogate identifier, and
> use those identifiers as node_id in an object graph, that will be what I
> want. but in order to do that I need to be able to query the lucene
> index by document_id.
>
> I was referring to the link db of the nutch. They do have some sort of
> link db implementation, that runs with hadoop, but I have not understood
> the full code.
> I am trying to understand the structure of this link database. I was
> thinking of using documents with src and dst fields, that have document
> id's as values. (one idea, I will try it tomorrow)
>
> Again thanks a bunch.
>
> Best Regards,
> C.B.
>
> Erick Erickson wrote:
> > Aproach it in whatever way you want as long as it solves your problem
> > <G>.
> >
> > My first question is why use lucene? Would a database suit your needs
> > better? Of course, I can't say. Lucene shines at full-text searching, so
> > it's a closer call if you aren't searching on parts of text. By that I
> > mean
> > that if you're not searching on *parts* of your links, you may want to
> > consider a DB solution.
> >
> > That said, and if I understand your requirement, you have a pretty
> simple
> > design. Each document has two fields, incominglinks and outgoing
> > links. But
> > see the note below. Lucene indexes what you give it, so the fact that
> > some
> > of the links aren't hypertext links is immaterial to Lucene. Since you
> > control both the indexer and searcher, these confrom to whatever your
> > requirements are. It's up to you to map semantics onto these entities.
> >
> > One common trap DB-savvy people have is that they think of documents as
> > entries in a table, all with the same fields. There is nothing
> > requiring you
> > to have the *same* fields in each document in an index. You could have
> an
> > index for which no two documents shared *any* common field if you
> choose.
> >
> > So, if you want to find out what, say, which documents have link X as an
> > incoming link, just search on incominglinks:X. If you wanted to find the
> > documents that had any incoming links X, Y, Z that matched an outgoing
> > link
> > in another document, just search the OR of these in outgoinglinks.
> >
> > If you want some kind of map of the whole web of links, you'll have to
> > write
> > some iterative loop and keep track. There's nothing built in that I
> > know of
> > that lets you answer "Given link X, show me all the documents no more
> > than 3
> > hops away". Lucene is an *engine*, designed to have apps built on top
> > of it.
> > Lucene doesn't deal with relations between documents, just searching
> what
> > you've indexed.
> >
> > It's easy enough to store a variable number of links in your
> > incominglinks
> > or outgoinglinks field. Just be sure they're tokenized appropriately.
> You
> > can add them any way you choose, either concatenate them all into a big
> > string and index that, or index them into the same field, e.g.
> > Document doc = new Document();
> > doc.add("incoming", "link1");
> > doc.add("incoming", "link2");
> > .
> > .
> > .
> > writer.add(doc);
> >
> > According to a discussion from a while ago, this is the same as
> > doc.add("incoming", "link1 link2");
> > in terms of how it all gets handled internally.
> >
> >
> > NOTE: I'm skipping most of the question of which Analyzer you use.
> > This will
> > almost surely trip you up sometime. I'd suggest starting with
> > WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
> > will break your links up in ways you don't expect. Really, really,
> really
> > get a copy of Luke to see what's actually *in* your index and how
> > searches
> > work. And how the analyzer you choose changes what's searched for, as
> > well
> > as what's indexec. Google lucene luke and you'll find it.
> >
> > Anyway, hope this all helps.
> > Erick
> >
> > On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
> >>
> >> Hello,
> >>
> >> I would like to make a link database using lucene. Similar to one that
> >> nutch uses. I have read the basic documentation and understood how
> >> document indexing, search, and scoring works. But what I like is
> >> different documents having different kind of links (semantic links) to
> >> each other. I would like to be able to search in the database like
> >> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I
> am
> >> talking about, might not necessarily be hypertext links.
> >>
> >> How would I approach to a problem like this?
> >>
> >> Best Regards,
> >> -C.B.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>





                
___________________________________________________________ 
Yahoo! Messenger - with free PC-PC calling and photo sharing. 
http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene link database

Reply via email to