On Tuesday 17 November 2009 16:55:21 Jon Schneider wrote: > > When you say "anywhere you choose", is it limited to a location on the > > filesystem? Or do you intend to make use of ivy repositories > > access/publish > > > mechanism to store the index remotely? With filesystem only the usage > > sounds rather limited. With ivy repository mechanism you can store your > > index on the same kind of store as where you put your modules, but you > > will > > > need a more advanced syntax to configure it, and more advanced > > > > implementation. > > Right now it is limited to locations on the filesystem. I agree, the > repository mechanism would be more flexible, but I do need to evaluate the > performance of storing/reading the index across different storage mediums. > > > So you will have to deal with index locking during updates, which may > > become a contention point, and be difficult to implement if you want to > > > > allow using any repo to store the index. > > Thanks for bringing this point up. Lucene offers a write lock contention > mechanism, but I do need to tread carefully here. > > > If the index grows, accessing the index from a remote box may become > > long. If you think big, you will have to find a way to transfer index > > updates to the clients which is optimizing the network, such as > > transferring diffs or something similar. But this becomes difficult to > > implement, unless you > > want > > > to rely on existing technology for that (such as a SCM). > > I am having trouble trying to manufacture a scalability problem here (with > my unscientific approach). I am up to 1,149 jars containing class types > with over 28,700 types in my test repository and the index is at 39 mb.
Did you tried to compress it ? I would expect that the index would be transferred compressed over the network. > I've pushed the index out on a remote filesystem, and the quick search > opens the index in 219 ms. After the index reader is opened, subsequent > searches return in the microsecond range until the reader becomes stale > from a commit and is reopened. > > Interesting point about the growth of the index based on the topology of > the repository: modules with hundreds or thousands of revisions (e.g. > nightly builds) do not add much bulk to the index because there is so much > overlap in type names across the builds. The duplicate type names get > optimized down. > > > The last time I worked with Lucene we implemented a such diff and publish > > > mecanism for Lucene indexes, and it was working quite well. Solr does > > have a > > mecanism for such things too, but the last time I checked it was just > > relying > > on rsync. If somebody is interested I can take some time to explain it > > here. > > Not totally convinced that a scalability problem is out of the question, > I'm interested in what you have to offer on this point, Nicolas. We used a feature of Lucene which allows to merge two indexes, adding every Lucene document from one index into another [1]. The issue here is that Lucene has no notion of replacing a document. So a Lucene index update was both a Lucene index with the new indexed data and a list of the ids of the documents to delete or update. Then applying an index update on a "full" index is deleting the specified list of documents and merging the index update in the full index. Note that it only works if a Lucene document can be uniquely identified. For the Ivy use case I think this can fit as the unique id would be the org#module;revision To track the version of the index, Lucene itself provides a version number [2]. I don't remember well if we can rely safely on it. I think we did, but it might work only if the exact same version of Lucene is used everywhere, as the segment merging algorithm would be the same. At least the Lucene API doesn't garanty that a merge of two indexes produce the same version, the API just garanties that it will be upper on each "commit". In our use case we had the situation where there was an indexer and sevral search slaves. The indexer was responsible to publish a full index and a set of index updates. So that a search slave starting empty will just get the full index. And along the time, a slaves asks the indexer just updates corresponding to its version. So there may be situation were the slave is a little late, and will get sevral updates to apply. And sometimes so late that the slave will get a full index, as the indexer just maintains a finite set of index updates. That scenario is corresponding quite well of one of the case described where an Ivy repository is managed on the server side. Managing it from the client side maybe more complex, as Lucene only support only one writer at a time. But we can imagine that each time there is a publish, there would also be a "publication" of an "index update". Then the complexity is reported on a periodic purge of old updates and a build of a full index: which client would be "elected" to do it ? And how do we deal with simultaneous publication ? Few words on Solr's [3] index replication mechanism. As previously wrote, the transport of the files is done by some rsync. This is actually quite smart knowing how the Lucene indexer works with files. First it never modifies files or never append data to a file. Once wrote a file doesn't change (see the API it is relying on [4]). So a diff between two versions of an index is some deleted files and some new files. Secondly, when we index new data on a already filled index, as Lucene doesn't modify any file, it will actually create an internal "segment" containing the new indexed data. Opening an new IndexReader on the new version of the index is then taking into consideration the added segment. As we index data, there are then more and more segments. To avoid having to many files, sometimes it decides to merge sevral little segments into a bigger one [5]. So we can say that a Lucene index is composed of big old files and little new ones. So this quite perfect for rsync. The more you rsync, the less you have to transport on each run. We didn't like relying on some platform dependant tool and we liked the idea that an index update is just a zip of files (we actaully had some other data to update, so one file for all). What we implemented is actually quite similar to how Lucene works with its internal "segments": the update contains just the new indexed data, older update being the full index itself. I don't think Ivy should rely on rsync either, but probably we could use the same kind of mechanism rsync use. It would be quite easy to implement as there would be no binary diff. It doesn't solve the critical case where there is a simultaneous publication though. I am starting to have ideas, but I think that this mail is already too long, let's take a breath :) Nicolas [1] http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/IndexWriter.html#addIndexes%28org.apache.lucene.index.IndexReader[]%29 [2] http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/IndexReader.html#getVersion%28%29 [3] http://lucene.apache.org/solr/ [4] http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/store/Directory.html [5] http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/MergePolicy.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org For additional commands, e-mail: dev-h...@ant.apache.org