First, are you sure you're free memory calculation is OK? Why not just use freeMemory?
I _think_ my calculation is ok - my reasoning:
Runtime.maxMemory() - Amount of memory that can be given to the JVM - based on -Xmx<value> Runtime.totalMemory() - Amount of memory currently owned by the JVM Runtime.freeMemory() - Amount of unused memory currently owned by the JVM The amount of memory currently in use (inUse) = totalMemory() - freeMemory() The amount we can still get (before hitting -Xmx<value>) = maxMemory() - inUse And in the absence of -Xms, nothing to say we will be given that much.
Perhaps also calling the gc if the avail isn't enough. Although I confess I don't know the innards of the interplay of getting the various memory amounts.....
I do call the gc - but sparingly. If I've done a flush to reclaim memory in hopes of having enough memory for a pending document, then I'll call the gc before checking if I now have enough memory available. However, I too know little of the gc workings. On the assumption that the JRE is smarter at knowing how & when to execute gc than I am, I operate on the premise that it is not a good practice to routinely be calling the gc.
The approach I've been using is to gather some data as I'm indexing to decide whether to flush the indexwriter or not. That is, record the size change that ramSizeInBytes() returns before I start to index a document, record the amount after, and keep the worst ratio around. This got easier when I subclassed IndexWriter and overrode the add methods.
I agree this gives a conservative measure of the worse case memory consumption by an indexed document. But it measures memory _after_ indexing. My observation is that the peak memory usage occurs _during_ indexing - so that if the process is low on memory, that is when the problem (OutOfMemoryError) will hit. In my mind it is the peak usage that really matters. If there were a way to record and retrieve peak usage for each document, we would be able to see if there is a relationship between the peak during indexing and ramSizeInBytes() after indexing. If there were a (somewhat) predictable relationship, then I think we'd have a more accurate value to decide on a factor to use for avoiding OutOfMemeoryErrors.
Then I'm requiring that I have 2X the worst case I've seen for the incoming document, and flushing (perhaps gc-ing) if I don't have enough.
Based on the data I've collected, we've been using 1.5x - 2.0x of document size as our value (and made it a configuration parameter).
And I think that this is "good enough". What it allows (as does your approach) is letting the usual cases of much smaller than 20M+ files to accumulate and flush reasonably efficiently, and not penalizing my speed by, say, always keeping 250M free or some such.
Agreed... To me it is a balancing act of avoiding OutOfMemoryErrors without unnecessarily throttling throughput in order to keep that 250M (or whatever) of memory available for what we think is the unusual document - and one that arrives for indexing while available memory is relatively low. If it arrives when the indexer isn't busy with other documents, then likely not a problem anyway.
Keep me posted if you come up with anything really cool!
Ditto. Thanks, david. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]