A quick word of caution about doc IDs. Lucene assigns a document id at index time, but that ID is *not* guaranteed to remain the same for a given document. For instance... you index docs A, B, and C. They get Lucene IDs 1, 2, 3. Then you remove doc B and optimize the index. As I understand it, doc C will get re-assigned ID 2, and ID 3 won't exist.
In reality, I don't think that the algorithm is quite as simplistic as that, but that's the idea. So be sure to assign your own unique identifiers that you add to your docs as a field value. Others on this list have talked abouta hybrid solution. That is, have *both* lucene and a database, each doing what they do best. It's more complicated, especially keeping the two in synch. some tools have been mentioned, I think if you search the archive for database you'll bet a bunch of threads. But I thought I'd mention it.. Best of luck Erick On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
Dear Erick; Thank you for your detailed insight. I have been trying to code a graph object database for sometime. I have prototyped on relational as well as object oriented databases, including opensource and commercial implementations. (so far, I have tried hibernate, objectivity/db, db4o) while object databases excel in traversing links, they are poor when searching. lucene so far solves the problem of solving. I am thinking of a document as a list of tuples. (sequence of fields) and I can do searches with lucene, it is really nice. now I have to solve the problem of linking. if I keep the nodes with a lucene index, and I can fetch documents with a doc_id, or some sort of surrogate identifier, and use those identifiers as node_id in an object graph, that will be what I want. but in order to do that I need to be able to query the lucene index by document_id. I was referring to the link db of the nutch. They do have some sort of link db implementation, that runs with hadoop, but I have not understood the full code. I am trying to understand the structure of this link database. I was thinking of using documents with src and dst fields, that have document id's as values. (one idea, I will try it tomorrow) Again thanks a bunch. Best Regards, C.B. Erick Erickson wrote: > Aproach it in whatever way you want as long as it solves your problem > <G>. > > My first question is why use lucene? Would a database suit your needs > better? Of course, I can't say. Lucene shines at full-text searching, so > it's a closer call if you aren't searching on parts of text. By that I > mean > that if you're not searching on *parts* of your links, you may want to > consider a DB solution. > > That said, and if I understand your requirement, you have a pretty simple > design. Each document has two fields, incominglinks and outgoing > links. But > see the note below. Lucene indexes what you give it, so the fact that > some > of the links aren't hypertext links is immaterial to Lucene. Since you > control both the indexer and searcher, these confrom to whatever your > requirements are. It's up to you to map semantics onto these entities. > > One common trap DB-savvy people have is that they think of documents as > entries in a table, all with the same fields. There is nothing > requiring you > to have the *same* fields in each document in an index. You could have an > index for which no two documents shared *any* common field if you choose. > > So, if you want to find out what, say, which documents have link X as an > incoming link, just search on incominglinks:X. If you wanted to find the > documents that had any incoming links X, Y, Z that matched an outgoing > link > in another document, just search the OR of these in outgoinglinks. > > If you want some kind of map of the whole web of links, you'll have to > write > some iterative loop and keep track. There's nothing built in that I > know of > that lets you answer "Given link X, show me all the documents no more > than 3 > hops away". Lucene is an *engine*, designed to have apps built on top > of it. > Lucene doesn't deal with relations between documents, just searching what > you've indexed. > > It's easy enough to store a variable number of links in your > incominglinks > or outgoinglinks field. Just be sure they're tokenized appropriately. You > can add them any way you choose, either concatenate them all into a big > string and index that, or index them into the same field, e.g. > Document doc = new Document(); > doc.add("incoming", "link1"); > doc.add("incoming", "link2"); > . > . > . > writer.add(doc); > > According to a discussion from a while ago, this is the same as > doc.add("incoming", "link1 link2"); > in terms of how it all gets handled internally. > > > NOTE: I'm skipping most of the question of which Analyzer you use. > This will > almost surely trip you up sometime. I'd suggest starting with > WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers > will break your links up in ways you don't expect. Really, really, really > get a copy of Luke to see what's actually *in* your index and how > searches > work. And how the analyzer you choose changes what's searched for, as > well > as what's indexec. Google lucene luke and you'll find it. > > Anyway, hope this all helps. > Erick > > On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote: >> >> Hello, >> >> I would like to make a link database using lucene. Similar to one that >> nutch uses. I have read the basic documentation and understood how >> document indexing, search, and scoring works. But what I like is >> different documents having different kind of links (semantic links) to >> each other. I would like to be able to search in the database like >> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am >> talking about, might not necessarily be hypertext links. >> >> How would I approach to a problem like this? >> >> Best Regards, >> -C.B. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]