>>if you search the archive for database you'll bet a bunch of threads
This was a hybrid implementation I did which worked with HSQLDB and Derby: http://www.mail-archive.com/java-user@lucene.apache.org/msg02953.html Cheers Mark ----- Original Message ---- From: Erick Erickson <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, 8 October, 2006 8:33:59 PM Subject: Re: lucene link database A quick word of caution about doc IDs. Lucene assigns a document id at index time, but that ID is *not* guaranteed to remain the same for a given document. For instance... you index docs A, B, and C. They get Lucene IDs 1, 2, 3. Then you remove doc B and optimize the index. As I understand it, doc C will get re-assigned ID 2, and ID 3 won't exist. In reality, I don't think that the algorithm is quite as simplistic as that, but that's the idea. So be sure to assign your own unique identifiers that you add to your docs as a field value. Others on this list have talked abouta hybrid solution. That is, have *both* lucene and a database, each doing what they do best. It's more complicated, especially keeping the two in synch. some tools have been mentioned, I think if you search the archive for database you'll bet a bunch of threads. But I thought I'd mention it.. Best of luck Erick On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote: > > Dear Erick; > > Thank you for your detailed insight. I have been trying to code a graph > object database for sometime. > I have prototyped on relational as well as object oriented databases, > including opensource and commercial implementations. > (so far, I have tried hibernate, objectivity/db, db4o) while object > databases excel in traversing links, they are poor when searching. > > lucene so far solves the problem of solving. I am thinking of a document > as a list of tuples. (sequence of fields) and I can do searches with > lucene, it is really nice. > > now I have to solve the problem of linking. if I keep the nodes with a > lucene index, and I can fetch documents with a doc_id, or some sort of > surrogate identifier, and > use those identifiers as node_id in an object graph, that will be what I > want. but in order to do that I need to be able to query the lucene > index by document_id. > > I was referring to the link db of the nutch. They do have some sort of > link db implementation, that runs with hadoop, but I have not understood > the full code. > I am trying to understand the structure of this link database. I was > thinking of using documents with src and dst fields, that have document > id's as values. (one idea, I will try it tomorrow) > > Again thanks a bunch. > > Best Regards, > C.B. > > Erick Erickson wrote: > > Aproach it in whatever way you want as long as it solves your problem > > <G>. > > > > My first question is why use lucene? Would a database suit your needs > > better? Of course, I can't say. Lucene shines at full-text searching, so > > it's a closer call if you aren't searching on parts of text. By that I > > mean > > that if you're not searching on *parts* of your links, you may want to > > consider a DB solution. > > > > That said, and if I understand your requirement, you have a pretty > simple > > design. Each document has two fields, incominglinks and outgoing > > links. But > > see the note below. Lucene indexes what you give it, so the fact that > > some > > of the links aren't hypertext links is immaterial to Lucene. Since you > > control both the indexer and searcher, these confrom to whatever your > > requirements are. It's up to you to map semantics onto these entities. > > > > One common trap DB-savvy people have is that they think of documents as > > entries in a table, all with the same fields. There is nothing > > requiring you > > to have the *same* fields in each document in an index. You could have > an > > index for which no two documents shared *any* common field if you > choose. > > > > So, if you want to find out what, say, which documents have link X as an > > incoming link, just search on incominglinks:X. If you wanted to find the > > documents that had any incoming links X, Y, Z that matched an outgoing > > link > > in another document, just search the OR of these in outgoinglinks. > > > > If you want some kind of map of the whole web of links, you'll have to > > write > > some iterative loop and keep track. There's nothing built in that I > > know of > > that lets you answer "Given link X, show me all the documents no more > > than 3 > > hops away". Lucene is an *engine*, designed to have apps built on top > > of it. > > Lucene doesn't deal with relations between documents, just searching > what > > you've indexed. > > > > It's easy enough to store a variable number of links in your > > incominglinks > > or outgoinglinks field. Just be sure they're tokenized appropriately. > You > > can add them any way you choose, either concatenate them all into a big > > string and index that, or index them into the same field, e.g. > > Document doc = new Document(); > > doc.add("incoming", "link1"); > > doc.add("incoming", "link2"); > > . > > . > > . > > writer.add(doc); > > > > According to a discussion from a while ago, this is the same as > > doc.add("incoming", "link1 link2"); > > in terms of how it all gets handled internally. > > > > > > NOTE: I'm skipping most of the question of which Analyzer you use. > > This will > > almost surely trip you up sometime. I'd suggest starting with > > WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers > > will break your links up in ways you don't expect. Really, really, > really > > get a copy of Luke to see what's actually *in* your index and how > > searches > > work. And how the analyzer you choose changes what's searched for, as > > well > > as what's indexec. Google lucene luke and you'll find it. > > > > Anyway, hope this all helps. > > Erick > > > > On 10/8/06, Cam Bazz <[EMAIL PROTECTED]> wrote: > >> > >> Hello, > >> > >> I would like to make a link database using lucene. Similar to one that > >> nutch uses. I have read the basic documentation and understood how > >> document indexing, search, and scoring works. But what I like is > >> different documents having different kind of links (semantic links) to > >> each other. I would like to be able to search in the database like > >> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I > am > >> talking about, might not necessarily be hypertext links. > >> > >> How would I approach to a problem like this? > >> > >> Best Regards, > >> -C.B. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ___________________________________________________________ Yahoo! Messenger - with free PC-PC calling and photo sharing. http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]