I can share the data.. but it would be quicker for you to just pull out some random text from anywhere you like.
The issue is that the text was in an email, which was one of about 2,000 and I don't know which one. I got the 4.5MB figure from the number of bytes in the byte array reported in the debugger... and didn't bother to record the email file it was contained in. Anyway.. I think it was text extracted from a PDF extracted from a ZIP... so it would take me a while to locate! It's worth noting that the time I quoted is somewhat misleading. I killed my process after 10 minutes because I realised there was a problem and any further time was irrelevant. But... the length of time is partially due to the load on the process. I am processing multiple files concurrently, and in so doing am performing a bunch of CPU intensive tasks (text extraction, encryption etc). Most of this happens in separate threads, but they are all competing for CPU time. The only way to really benchmark the performance of the compression is to combine both compression levels, with thread numbers to see how it scales. I'm confident that the compression mechanism used in Lucene is fine (had a look at the code... all seems pretty good), so I would guess that Lucene would have performance comparable to "vanilla" compression using the native java libs. I'm betting you get non-linear scalability no matter what the compression level (due to the max throughput of the CPU, bus speed etc); but you may find scalability tends towards a linear curve (oxymoron?) the lower the compression level. This is really what I am looking for. Also.. upon reflection I'm not certain using compression inside the index is really a valuable process without lazy loading anyway. The time-cost of decompression when iterating hits reduces the overall effectiveness of the index. This is obviously solved by lazy loading (for searches) and I am excited about this feature being added. Obviously it depends on the use-case, but in mine I realised that storing large amounts of data in the index is just not the right way to do things. So I changed my architecture so that the larger amounts of data are stored (and compressed) elsewhere, then brought back in when I need to update a document. Of course all my problems would be solved if I had lazy loading AND field updating :) On 8/11/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
> I have a sample document which has about 4.5MB of text to be stored as > compressed data within the field, and the indexing of this document > seems to > take an inordinate amount of time (over 10 minutes!). When debugging I can > see that it's stuck on the deflate() calls of the Deflater used by Lucene. Would it be possible to get a copy of this document's text (only if you're able to share it)? I'd like to run some tests to work out the tradeoff (time taken vs % deflated) of the different levels we can pass to the zip library. If not that's fine, I'll just run on various random text sources I can find. Thanks. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]