Re: Ivy Indexer

Nicolas Lalevée Wed, 18 Nov 2009 10:05:15 -0800

On Tuesday 17 November 2009 16:55:21 Jon Schneider wrote:
> > When you say "anywhere you choose", is it limited to a location on the
> > filesystem? Or do you intend to make use of ivy repositories
>
> access/publish
>
> > mechanism to store the index remotely? With filesystem only the usage
> > sounds rather limited. With ivy repository mechanism you can store your
> > index on the same kind of store as where you put your modules, but you
>
> will
>
> > need a more advanced syntax to configure it, and more advanced
> >
>  > implementation.
>
> Right now it is limited to locations on the filesystem.  I agree, the
> repository mechanism would be more flexible, but I do need to evaluate the
> performance of storing/reading the index across different storage mediums.
>
> > So you will have to deal with index locking during updates, which may
> > become a contention point, and be difficult to implement if you want to
> >
>  > allow using any repo to store the index.
>
> Thanks for bringing this point up.  Lucene offers a write lock contention
> mechanism, but I do need to tread carefully here.
>
> > If the index grows, accessing the index from a remote box may become
> > long. If you think big, you will have to find a way to transfer index
> > updates to the clients which is optimizing the network, such as
> > transferring diffs or something similar. But this becomes difficult to
> > implement, unless you
>
> want
>
> > to rely on existing technology for that (such as a SCM).
>
> I am having trouble trying to manufacture a scalability problem here (with
> my unscientific approach).  I am up to 1,149 jars containing class types
> with over 28,700 types in my test repository and the index is at 39 mb.


Did you tried to compress it ? I would expect that the index would be 
transferred compressed over the network.

>  I've pushed the index out on a remote filesystem, and the quick search
> opens the index in 219 ms.  After the index reader is opened, subsequent
> searches return in the microsecond range until the reader becomes stale
> from a commit and is reopened.
>
> Interesting point about the growth of the index based on the topology of
> the repository:  modules with hundreds or thousands of revisions (e.g.
> nightly builds) do not add much bulk to the index because there is so much
> overlap in type names across the builds.  The duplicate type names get
> optimized down.
>
>
> The last time I worked with Lucene we implemented a such diff and publish
>
> > mecanism for Lucene indexes, and it was working quite well. Solr does
> > have a
> > mecanism for such things too, but the last time I checked it was just
> > relying
> >  on rsync. If somebody is interested I can take some time to explain it
> > here.
>
> Not totally convinced that a scalability problem is out of the question,
> I'm interested in what you have to offer on this point, Nicolas.

We used a feature of Lucene which allows to merge two indexes, adding every 
Lucene document from one index into another [1].
The issue here is that Lucene has no notion of replacing a document. So a 
Lucene index update was both a Lucene index with the new indexed data and a 
list of the ids of the documents to delete or update. Then applying an index 
update on a "full" index is deleting the specified list of documents and 
merging the index update in the full index.

Note that it only works if a Lucene document can be uniquely identified. For 
the Ivy use case I think this can fit as the unique id would be the  
org#module;revision

To track the version of the index, Lucene itself provides a version number 
[2]. I don't remember well if we can rely safely on it. I think we did, but  
it might work only if the exact same version of Lucene is used everywhere, as 
the segment merging algorithm would be the same. At least the Lucene API 
doesn't garanty that a merge of two indexes produce the same version, the API 
just garanties that it will be upper on each "commit".

In our use case we had the situation where there was an indexer and sevral 
search slaves. The indexer was responsible to publish a full index and a set 
of index updates. So that a search slave starting empty will just get the full 
index. And along the time, a slaves asks the indexer just updates 
corresponding to its version. So there may be situation were the slave is a 
little late, and will get sevral updates to apply. And sometimes so late that 
the slave will get a full index, as the indexer just maintains a finite set 
of index updates.

That scenario is corresponding quite well of one of the case described where 
an Ivy repository is managed on the server side.

Managing it from the client side maybe more complex, as Lucene only support 
only one writer at a time. But we can imagine that each time there is a 
publish, there would also be a "publication" of an "index update". Then the 
complexity is reported on a periodic purge of old updates and a build of a 
full index: which client would be "elected" to do it ? And how do we deal 
with simultaneous publication ?

Few words on Solr's [3] index replication mechanism. As previously wrote, the 
transport of the files is done by some rsync. This is actually quite smart 
knowing how the Lucene indexer works with files.
First it never modifies files or never append data to a file. Once wrote a 
file doesn't change (see the API it is relying on [4]). So a diff between two 
versions of an index is some deleted files and some new files.
Secondly, when we index new data on a already filled index, as Lucene doesn't 
modify any file, it will actually create an internal "segment" containing the 
new indexed data. Opening an new IndexReader on the new version of the index 
is then taking into consideration the added segment. As we index data, there 
are then more and more segments. To avoid having to many files, sometimes it 
decides to merge sevral little segments into a bigger one [5]. So we can say 
that a Lucene index is composed of big old files and little new ones.
So this quite perfect for rsync. The more you rsync, the less you have to 
transport on each run.

We didn't like relying on some platform dependant tool and we liked the idea 
that an index update is just a zip of files (we actaully had some other data 
to update, so one file for all). What we implemented is actually quite 
similar to how Lucene works with its internal "segments": the update contains 
just the new indexed data, older update being the full index itself.

I don't think Ivy should rely on rsync either, but probably we could use the 
same kind of mechanism rsync use. It would be quite easy to implement as 
there would be no binary diff. It doesn't solve the critical case where there 
is a simultaneous publication though.

I am starting to have ideas, but I think that this mail is already too long, 
let's take a breath :)

Nicolas

[1] 
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/IndexWriter.html#addIndexes%28org.apache.lucene.index.IndexReader[]%29
[2] 
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/IndexReader.html#getVersion%28%29
[3] http://lucene.apache.org/solr/
[4] 
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/store/Directory.html
[5] 
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/MergePolicy.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org
For additional commands, e-mail: dev-h...@ant.apache.org

Re: Ivy Indexer

Reply via email to