Hi,

I've started thinking at how to implement IVY-658, and it's not that easy to 
get information necessary to clean a repository intelligently.

To do so, we need to have information about the dependers of a module. If Ivy 
shines at finding dependees, it provides no way to access dependers, so this is 
something requiring some work.

To start with, I've developed a small prototype of a 
RepositoryManagementEngine, which actually loads all repository metadata in 
memory. First I didn't even consider that idea, because it doesn't scale. But 
since it was the easiest thing to start with, I gave it a try, and I now have 
something able to load a whole repository in memory, and able to return 
information such as the list of all modules with no dependers (the information 
needed for IVY-658). It could also be improved to easily to get other 
informations quickly, once everything is in memory it's pretty easy.

I'm currently doing my test on a linux box with sun jvm 1.4.2, accessing a 
NetApp filesystem repository (with very good performance). On this box I have 
this kind of results to load a repository of 1200 modules, 3000 module 
revisions: 40s, 60MB (memory usage is approximative, I've used an utility class 
based on [1]). If I extrapolate these results, here's what I get:
revs    time    memory
3k      40"     60MB
25k     6'      500MB
100k    22'     2GB

I'm pretty happy with the time results (the environment is well suited for 
that, but since it's a repository maintenance task, I guess most people could 
run it very close to their repository data, during night or over a week-end).

As expected memory usage can more quickly become an issue. So I've done some 
investigation on memory usage, and it appears that the ModuleRevisionId have a 
significant impact on memory usage. Indeed these objects are used not only to 
identify the module revisions loaded, but also in each dependency descriptor to 
store the content of the requested module revision.

I've found that in my use case Ivy was creating around 50k instances of 
ModuleRevisionId. These objects being immutable, I've tried to use a strategy 
similar to String#intern() to reuse the same instance whenever possible. I'be 
then decreased the number to 6k instances, with a total memory used by the in 
memory repository information of 43MB (around 28% better).

Then I thought another area of improvement may be the dependency descriptors 
themselves (around 46k instances in my test case). In 
DefaultDependencyDescriptor, we create the instances of LinkedHashMap used to 
store information when we create the object. For the exclude rules, include 
rules and dependency artifacts, very frequently they are not used at all (never 
in my test case). So I've change DefaultDependencyDescriptor to init these 
attributes only when needed, and ended up with a 31MB footprint for the whole 
repository. So my new extrapolation is now:
revs    time    memory
3k      40"     31MB
25k     6'      260MB
100k    22'     1.1GB

So I plan to commit these changes to Ivy trunk. The changes on 
DefaultDependencyDescriptor just makes the code slightly less readable, so I 
don't think it's an issue. For ModuleRevisionId, it introduces a very simple 
cache of instances based on a WeakHashMap. It means we have a get in a Map 
whenever we create a new ModuleRevisionId. I don't think it will impact the 
performance much, and may even decrease memory footprint for regular Ivy usage.

If you see any problem with that, feel free to let me know and we'll see how to 
address that differently.

BTW, the repository cleaning task is not done yet, just repository loading and 
basic analysis.

Xavier

 [1] 
http://java.sun.com/docs/books/performance/1st_edition/html/JPRAMFootprint.fm.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to