> When you say "anywhere you choose", is it limited to a location on the > filesystem? Or do you intend to make use of ivy repositories access/publish > mechanism to store the index remotely? With filesystem only the usage > sounds rather limited. With ivy repository mechanism you can store your > index on the same kind of store as where you put your modules, but you will > need a more advanced syntax to configure it, and more advanced > implementation.
Right now it is limited to locations on the filesystem. I agree, the repository mechanism would be more flexible, but I do need to evaluate the performance of storing/reading the index across different storage mediums. > So you will have to deal with index locking during updates, which may > become a contention point, and be difficult to implement if you want to > allow using any repo to store the index. Thanks for bringing this point up. Lucene offers a write lock contention mechanism, but I do need to tread carefully here. > If the index grows, accessing the index from a remote box may become long. > If you think big, you will have to find a way to transfer index updates to > the clients which is optimizing the network, such as transferring diffs or > something similar. But this becomes difficult to implement, unless you want > to rely on existing technology for that (such as a SCM). I am having trouble trying to manufacture a scalability problem here (with my unscientific approach). I am up to 1,149 jars containing class types with over 28,700 types in my test repository and the index is at 39 mb. I've pushed the index out on a remote filesystem, and the quick search opens the index in 219 ms. After the index reader is opened, subsequent searches return in the microsecond range until the reader becomes stale from a commit and is reopened. Interesting point about the growth of the index based on the topology of the repository: modules with hundreds or thousands of revisions (e.g. nightly builds) do not add much bulk to the index because there is so much overlap in type names across the builds. The duplicate type names get optimized down. The last time I worked with Lucene we implemented a such diff and publish > mecanism for Lucene indexes, and it was working quite well. Solr does have > a > mecanism for such things too, but the last time I checked it was just > relying > on rsync. If somebody is interested I can take some time to explain it > here. > Not totally convinced that a scalability problem is out of the question, I'm interested in what you have to offer on this point, Nicolas.