Re: lucene as a graph store

Cam Bazz Wed, 16 Jan 2008 00:07:05 -0800

Hello,

Not exactly, a document represents an edge, having src and dst node its.
Nodes can be kept in another index or the same one. I can find number of
edges by running a boolean term query.
Currently I am looking for a way to distribute indexes, but in such a way
that when querying you know which data store to look at.


I am testing the % operator on edges.

Consider the following.

edge A: 001 to 002
edge B: 002 to 005

lets say I am using %2. (I have two indices)

edge A is stored in index 1 (001%2=1) as well as index 0 (002%2=0)
edge B is stored in index 0 and 1.

in this scheme, when we are looking for an edge by its source node, we know
which index to look at by %. (the data is replicated - pointless when using
2 indexes, but for 4 it would work nicely)

so instead of running the same query on all of the boxes, we know which
boxes to run by the characteristics of the node.

Any ideas?

Best,
-C.B.





On Jan 15, 2008 6:54 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Aha, I see, a Document represents a node and all other nodes connected to
> it.  So you can really only find 2 nodes connected with an edge with a
> single query, but not, say, the number of edges (degrees?) between any 2
> nodes?
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Cam Bazz <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, January 15, 2008 11:34:20 AM
> Subject: Re: lucene as a graph store
>
> well lets say I have a list representation of a graph like
>
> src:1 dst:2
> src:2 dst:3
> src:1 dst 3
>
> outgoingEdgesOf(1) returns 2 and 3.
> incomingEdgesOf(3) returns 1 and 2.
>
> in a lucene index it does work out nice with term queries. I can search
>  for
> incoming outgoing or edgeExist with a boolean term query.
>
> Best.
>
>
> On Jan 15, 2008 6:22 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> >
> > ----- Original Message ----
> > From: Cam Bazz <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Tuesday, January 15, 2008 8:50:07 AM
> > Subject: Re: lucene as a graph store
> >
> > Usually for implementing things like page rank, or doing centrality
> >  metric
> > calculations or maybe dijkstras shortest term, this kind of (list of
> >  edges)
> > graph is not best at performance.
> > I like to use lucene for simple operations like neighboors of this
> >  node, or
> > 2 degree neighboors of this node.
> >
> > OG: Neighbours of a node?  Can you tell us more about what you use
>  and how
> > you use Lucene in this context?
> >
> > is updating an index costly operation in lucene? I dont think there
> >  will too
> > much updates, but rather deletes.
> >
> > when I delete documents from an index what things should I be careful
> >  about.
> >
> > OG: Note that deletes are not removed from the index/disk
>  immediately.
> >  Rather, their doc IDs are stored and those docs are skipped during
> > searches.  If you have a large number of deletes, you may want to
>  optimize
> > the index, so that this list of doc IDs to skip doesn't cost you.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > On Jan 15, 2008 3:31 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >
> > > I guess the question comes down to what kind of things are you
>  going
> > > to do w/ this graph?  How often are you updating links, etc?  I
>  can't
> > > say Lucene was designed for this kind of thing, but I am constantly
> > > amazed at what people use Lucene for, so I won't say it can't be
> > > done.  I don't know how efficient it would be for doing things like
> > > PageRank or other graph algorithms, but I would be interesting in
> > > hearing more about what you have in mind.
> > >
> > > Lucene doesn't do much caching at the document level, but that is
> > > fairly easy to implement, but it does bring some terms, etc. into
> > > memory, and you may have a look at the FieldCache.
> > >
> > > -Grant
> > >
> > > On Jan 15, 2008, at 7:17 AM, Cam Bazz wrote:
> > >
> > > > Hello;
> > > >
> > > > I like to use lucene as a graph store. The graph representation
>  is
> >  a
> > > > list of
> > > > edges. Consider the code below:
> > > >
> > > >        final int commitCount = 16 * 1024;
> > > >        final int numObj =  1024 * 1024;
> > > >
> > > >        Analyzer analyzer = new KeywordAnalyzer();
> > > >        FSDirectory directory = FSDirectory.getDirectory("c:\
> > > > \LuceneAdd");
> > > >        IndexWriter writer = new IndexWriter(directory, analyzer,
> > > > true);
> > > >
> > > >        Document doc;
> > > >        long start = System.currentTimeMillis();
> > > >
> > > >        Random r = new Random(System.currentTimeMillis());
> > > >
> > > >        for(int i=0; i<numObj; i++) {
> > > >            doc = new Document();
> > > >            doc.add(new Field("srcKey",
>  NumberTools.longToString(i),
> > > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > > >            doc.add(new Field("dstKey",
> > > > NumberTools.longToString(r.nextInt(numObj)),
> > > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > > >            doc.add(new Field("linkKey",
> > > > NumberTools.longToString(r.nextInt(16)),
> > > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > > >            doc.add(new Field("linkValue",
>  NumberTools.longToString(
> > > > r.nextInt(256)), Field.Store.YES, Field.Index.UN_TOKENIZED));
> > > >
> > > >           writer.addDocument(doc);
> > > >
> > > >            if(i%commitCount==0) {
> > > >                 long now = System.currentTimeMillis();
> > > >                 System.out.println(i + ":" + (now-start));
> > > >                 start = now;
> > > >            }
> > > >
> > > >        }
> > > >
> > > >        writer.optimize();
> > > >        writer.close();
> > > >        directory.close();
> > > >
> > > >
> > > > Basically I am adding a large number of documents from srcKey = i
> >  to
> > > > dstKey
> > > > = random and two other string fields - linkKey and linkValue.
> > > >
> > > > Compared to a normal database store, or an oodbms such as perst
>  or
> > > > db4o -
> > > > lucene takes longer to index.
> > > > However, it is much faster in searching, finding, retrieving
> >  records.
> > > >
> > > > I can make 16384 random lookups over 1Million entries in 0.8
> > > > seconds. This
> > > > is excellent time. (I have been benchmarking for a long time)
> > > >
> > > > Typically, when number of objects in BTree based structure in an
> > > > oodbms for
> > > > example increase, the search and add times also increase.
> > > >
> > > > Will lucene have the same problem and how can I overcome it if it
> > > > does.
> > > > Looking at the above code - does anyone has any recomendations
> > > > to improve index performance. (also what can I do to improve
>  search
> > > > performance)
> > > >
> > > > While searching with an indexsearcher - does lucene do any
>  caching?
> > > > usually
> > > > MRU caches are used to accomplish this.
> > > >
> > > > Any ideas,help,recomendations greatly appreciated.
> > > >
> > > > Best,
> > > > -C.B.
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucene.grantingersoll.com
> > > http://www.lucenebootcamp.com
> > >
> > > Lucene Helpful Hints:
> > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > http://wiki.apache.org/lucene-java/LuceneFAQ
> > >
> > >
> > >
> > >
> > >
> > >
>  ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: lucene as a graph store

Reply via email to