> When you say "anywhere you choose", is it limited to a location on the
> filesystem? Or do you intend to make use of ivy repositories
access/publish
> mechanism to store the index remotely? With filesystem only the usage
> sounds rather limited. With ivy repository mechanism you can store your
> index on the same kind of store as where you put your modules, but you
will
> need a more advanced syntax to configure it, and more advanced
 > implementation.

Right now it is limited to locations on the filesystem.  I agree, the
repository mechanism would be more flexible, but I do need to evaluate the
performance of storing/reading the index across different storage mediums.

> So you will have to deal with index locking during updates, which may
> become a contention point, and be difficult to implement if you want to
 > allow using any repo to store the index.

Thanks for bringing this point up.  Lucene offers a write lock contention
mechanism, but I do need to tread carefully here.

> If the index grows, accessing the index from a remote box may become long.
> If you think big, you will have to find a way to transfer index updates to
> the clients which is optimizing the network, such as transferring diffs or
> something similar. But this becomes difficult to implement, unless you
want
> to rely on existing technology for that (such as a SCM).

I am having trouble trying to manufacture a scalability problem here (with
my unscientific approach).  I am up to 1,149 jars containing class types
with over 28,700 types in my test repository and the index is at 39 mb.
 I've pushed the index out on a remote filesystem, and the quick search
opens the index in 219 ms.  After the index reader is opened, subsequent
searches return in the microsecond range until the reader becomes stale from
a commit and is reopened.

Interesting point about the growth of the index based on the topology of the
repository:  modules with hundreds or thousands of revisions (e.g. nightly
builds) do not add much bulk to the index because there is so much overlap
in type names across the builds.  The duplicate type names get optimized
down.


The last time I worked with Lucene we implemented a such diff and publish
> mecanism for Lucene indexes, and it was working quite well. Solr does have
> a
> mecanism for such things too, but the last time I checked it was just
> relying
>  on rsync. If somebody is interested I can take some time to explain it
> here.
>

Not totally convinced that a scalability problem is out of the question, I'm
interested in what you have to offer on this point, Nicolas.

Reply via email to