semi-infinite loop during merging
Hello all, I have a very peculiar problem that is driving me crazy: on some of our datasets and at some point in time during indexing, the merge operation runs into a (semi-)infinite loop and keeps adding files to the index until it runs out of free disk space. The situation: I have an indexing application that uses Lucene 2.4.1. Only one IndexWriter is involved, operating on a FSDirectory and using the compound file format. The index is created from scratch. No IndexReaders or IndexSearchers are open during indexing (double-checked by adding explicit log statements where they are created). For reasons unrelated to Lucene, the application is compiled with JET, a commercial Java Windows compiler. A regular Java build has produced the problem only once. The JET build does it every time - unless I keep pressing F5 continuously in Windows Explorer on the index dir. Here is what I see happening in the index dir: - At first, it builds _0.cfs to _9.cfs without problems. The files vary in size between 12 MB and 49 MB and add up to about 250 MB. - Then, .del files are generated for some of these .cfs files. The number of .del files and the .cfs files they correspond with differs from time to time. I don't understand why these are created, as no IndexReaders exist at this time. - Then, it generates files called _a.fdt, _a.fdx, _a.fnm, _a.frq, _a.nrm, _a.prx, _a.tii, _a.tis, _a.tvd, _a.tvf, _a.tvx. Together these files add up to 219 MB. I assume this is the start of the merge of the 10 .cfs files and this is all still correct. - Then, it generates _b files with those same extensions, then _c, _d, etc. It only keeps generating new files, I never see files disappear. The original .cfs files are still there. - This continues until my hard drive is out of free space. At one test I was at _8n and the index had grown from 250MB to 64 GB. Then the application just hangs. Interestingly, after killing the application in this test, there were _8k.cfs and _8m.cfs files of 20 MB and 27 MB respectively. No other .cfs files existed. In some older threads on this list (e.g. http://marc.info/?l=lucene-user&m=108300530413241&w=2) I read that "Win32 seems to sometimes not permit one to delete a file immediately after it has been closed". Could this explain the problem? Perhaps the JET-compiled app gets to delete the file quicker than when the code is running inside a Java VM and therefore runs into this issue? This would also explain why pressing F5 during indexing lets the application continue: external activity causing some manual delay. At the end of this mail I have added the output of the InfoStream installed on the IndexWriter, showing everything from the start to the first few problematic merges. Lines starting with "===" are println's in my own code to make sure that indeed only IndexWriters are generated and no IndexSearchers/-Readers. The fun starts almost at the bottom, at the first line containing "LMP: findMerges: 10 segments". The next lines then get repeated over and over again, with different segment names. I cannot explain why it mentions "1 deleted docIDs" a couple of lines below the first "findMerges: 10" as no IndexWriter.deleteDocuments takes place. As you can see in this output, I am setting a SerialMergeScheduler to rule out concurrency issues and making debugging easier. Both SerialMergeScheduler and ConcurrentMergeScheduler give this problem though. I would be grateful if anyone could pose some light on this or can advise me on what I can try next. I even considered hacking the FSDirectory code and adding some delay in the deleteFile operation to see if the above-mentioned win32 issue is the problem, but unless you know what you're doing, such hacks can even cause such problems in the first place. Kind regards, Chris -- === Creating new FSDirectory === Opening indexWriter (create=true) IFD [AWT-EventQueue-0]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@9fb0688 IW 0 [AWT-EventQueue-0]: setInfoStream: dir=org.apache.lucene.store.fsdirect...@d:\index autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@a11d0d8 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@a3fa9d8 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=2147483647 index= IW 0 [AWT-EventQueue-0]: setMergeScheduler org.apache.lucene.index.serialmergeschedu...@a3d1fe8 IW 0 [CrawlThread]: DW: RAM: now balance allocations: usedMB=46.015 vs trigger=16 allocMB=46.015 vs trigger=16.8 byteBlockFree=0 charBlockFree=0 IW 0 [CrawlThread]: DW: nothing to free; now set bufferIsFull IW 0 [CrawlThread]: DW: after free: freedMB=0 usedMB=46.015 allocMB=46.015 IW 0 [CrawlThread]: flush: segment=_0 docStoreSegment=_0 docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=false numDocs=2193 numBufDelTerms=0 IW 0 [CrawlThread]: index before flush IW 0
Re: semi-infinite loop during merging
I spent a lot of time on getting the stacktraces, but JET seems to make this impossible. Ctrl-Break, connecting with JConsole, even a "Dump Threads" button in my UI that uses Threads.getAllStacktraces were not able to produce a dump of all threads. I just got an additional confirmation that the problem also occurs with the Java build, but unfortunately the client's data is too sensitive to share it with me. One option is to hack the Lucene 2.4.1 code to print out some additional debug info. Do you known some println's that would help? Also I suspect that JET-compiled code is able to do Thread.dumpStack (but not Thread.getAllStackTraces), so what are good locations for doing that? E.g. IndexWriter.merge, etc. Regards, Chris -- Michael McCandless wrote: Hmmm, very very odd. First off, your "1 deleted docID" is because one document hit an exception during indexing, likely in enumerating tokens from the TokenStream; I see this line: IW 0 [CrawlThread]: hit exception adding document But I think that's fine (certainly should not cause what you are seeing). Indeed, it looks like IndexWriter decides it's time to merge the first 10 segments, and it starts that merge, but for some reason before that merge completes it seems to "recurse" and re-start the merge, to a different destination segment (_a then _b then _c, etc.). Ie the first merge never completes (we don't see lines with "commitMerge: merge=... index=..."). I don't think "not being able to immediately delete on windows" should cause this; IW simply retries the delete periodically. I think something more sinister is at play... On Unix, one can "kill -SIGQUIT" to get a thread stack trace dump for all threads; do you know how to do this on Windows? If so, can you do that at the end when IW starts doing this infinite merging? That would be very helpful towards understanding why this recursion is happening (though it is spooky that this is all happening under JET...) Mike On Tue, Apr 14, 2009 at 5:28 AM, Christiaan Fluit wrote: Hello all, I have a very peculiar problem that is driving me crazy: on some of our datasets and at some point in time during indexing, the merge operation runs into a (semi-)infinite loop and keeps adding files to the index until it runs out of free disk space. The situation: I have an indexing application that uses Lucene 2.4.1. Only one IndexWriter is involved, operating on a FSDirectory and using the compound file format. The index is created from scratch. No IndexReaders or IndexSearchers are open during indexing (double-checked by adding explicit log statements where they are created). For reasons unrelated to Lucene, the application is compiled with JET, a commercial Java Windows compiler. A regular Java build has produced the problem only once. The JET build does it every time - unless I keep pressing F5 continuously in Windows Explorer on the index dir. Here is what I see happening in the index dir: - At first, it builds _0.cfs to _9.cfs without problems. The files vary in size between 12 MB and 49 MB and add up to about 250 MB. - Then, .del files are generated for some of these .cfs files. The number of .del files and the .cfs files they correspond with differs from time to time. I don't understand why these are created, as no IndexReaders exist at this time. - Then, it generates files called _a.fdt, _a.fdx, _a.fnm, _a.frq, _a.nrm, _a.prx, _a.tii, _a.tis, _a.tvd, _a.tvf, _a.tvx. Together these files add up to 219 MB. I assume this is the start of the merge of the 10 .cfs files and this is all still correct. - Then, it generates _b files with those same extensions, then _c, _d, etc. It only keeps generating new files, I never see files disappear. The original .cfs files are still there. - This continues until my hard drive is out of free space. At one test I was at _8n and the index had grown from 250MB to 64 GB. Then the application just hangs. Interestingly, after killing the application in this test, there were _8k.cfs and _8m.cfs files of 20 MB and 27 MB respectively. No other .cfs files existed. In some older threads on this list (e.g. http://marc.info/?l=lucene-user&m=108300530413241&w=2) I read that "Win32 seems to sometimes not permit one to delete a file immediately after it has been closed". Could this explain the problem? Perhaps the JET-compiled app gets to delete the file quicker than when the code is running inside a Java VM and therefore runs into this issue? This would also explain why pressing F5 during indexing lets the application continue: external activity causing some manual delay. At the end of this mail I have added the output of the InfoStream installed on the IndexWriter, showing everything from the start to the first few problematic merges. Lines starting with "===" are println's in my own code to make sure that indeed only I
Re: MergeException
I have experienced similar problems (see the "semi-infinite loop during merging" thread - still working out the problem): the merger gets into an infinite loop and causes my drive to be filled with temporary files that are not deleted, until it runs out of space. Sometimes it exits with a MergeException wrapping one of a variety of IOExceptions (e.g. a FileNotFoundException), sometimes it just keeps on consuming all available CPU time. I think the "_k0z.fnm" file name indicates that a lot of segments have already been created, as it starts iterating with _0, _1, ... I don't want to jump to conclusions immediately, but this is consistent with a merger gone loose. Was your drive full as well afterwards? Regards, Chris -- Martine Woudstra wrote: Hi all, I'm using Lucene 2.4.1. for building an ngram index. Indexing works well until I try to open the index built so far with Luke. A MergeException is thrown, see below. Opening an index with Luke during indexing never caused problems with Lucene 2.3. Anyone familiar with this problem? Thanks in advance, Martine van der Heijden Exception in thread "Lucene Merge Thread #3067" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: D:\indexngram\_k0z.fnm (Het systeem kan het opgegeven bestand niet vinden) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:309) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:286) Caused by: java.io.FileNotFoundException: D:\indexngram\_k0z.fnm (Het systeem kan het opgegeven bestand niet vinden) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:212) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.(FSDirectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.(FSDirectory.java:582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:221) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:204) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4263) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MergeException
Michael McCandless wrote: On Tue, Apr 21, 2009 at 4:26 PM, Christiaan Fluit wrote: I have experienced similar problems (see the "semi-infinite loop during merging" thread - still working out the problem): the merger gets into an infinite loop and causes my drive to be filled with temporary files that are not deleted, until it runs out of space. Christiaan, I responded on that thread on where to sprinkle prints... any update? I have some new results. I will post them in the original thread. Regards, Chris -- - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: semi-infinite loop during merging
Michael McCandless wrote: One question: are you using IndexWriter.close(false)? I wonder if there's some path whereby the merges fail to abort (and simply keep retrying) if you do that... No, I don't. More inlined below... On Thu, Apr 16, 2009 at 5:42 AM, Christiaan Fluit wrote: I spent a lot of time on getting the stacktraces, but JET seems to make this impossible. Ctrl-Break, connecting with JConsole, even a "Dump Threads" button in my UI that uses Threads.getAllStacktraces were not able to produce a dump of all threads. Sigh. I just got an additional confirmation that the problem also occurs with the Java build, but unfortunately the client's data is too sensitive to share it with me. Could they "kill -QUIT" or Ctrl-Break when it's happening? The confirmation was received recently but the problem occurred a while ago and can't be reproduced now. One option is to hack the Lucene 2.4.1 code to print out some additional debug info. Do you known some println's that would help? How about adding this at the top of IndexWriter.handleMergeException: if (infoStream != null) { message("handleMergeException: merge=" + merge.segString(directory) + " exc=" + t); } ? You could also sprinkle prints in between where message("merge: total"...) occurs and the call to commitMerge, in IndexWriter.mergeMiddle. The mystery here is why you see these two lines in a row: IW 0 [CrawlThread]: merge: total 18586 docs IW 0 [CrawlThread]: LMP: findMerges: 10 segments That first line is supposed to be followed by a "commitMerge: ..." line. I suspect some exception (maybe a MergeAbortedException) is being hit, leading to the commit not happening. Also I suspect that JET-compiled code is able to do Thread.dumpStack (but not Thread.getAllStackTraces), so what are good locations for doing that? E.g. IndexWriter.merge, etc. This would be great -- I would add that in handleMergeException as well. A test run with the modified code just completed minutes ago. I modified IndexWriter in two places. IW.handleMergeException starts with the following code: if (infoStream != null) { message("handleMergeException: merge=" + merge.segString(directory) + " exc=" + t); message("dumping current stack"); new Exception("Stack trace").printStackTrace(infoStream); message("dumping throwable"); t.printStackTrace(infoStream); } To IW.mergeMiddle I added eight message(String) invocations: [...] try { [...] message("IW: checkAborted"); merge.checkAborted(directory); message("IW: checkAborted done"); // This is where all the work happens: message("IW: mergedDocCount"); mergedDocCount = merge.info.docCount = merger.merge(merge.mergeDocStores); message("IW: mergedDocCount done"); assert mergedDocCount == totDocCount; } finally { // close readers before we attempt to delete // now-obsolete segments if (merger != null) { message("IW: closeReaders"); merger.closeReaders(); message("IW: closeReaders done"); } } message("IW: commitMerge"); if (!commitMerge(merge, merger, mergedDocCount)) // commitMerge will return false if this merge was aborted return 0; message("IW: commitMerge done"); [...] This gives the following output: [...] IFD [CrawlThread]: delete "_9.fnm" IFD [CrawlThread]: delete "_9.frq" IFD [CrawlThread]: delete "_9.prx" IFD [CrawlThread]: delete "_9.tis" IFD [CrawlThread]: delete "_9.tii" IFD [CrawlThread]: delete "_9.nrm" IFD [CrawlThread]: delete "_9.tvx" IFD [CrawlThread]: delete "_9.tvf" IFD [CrawlThread]: delete "_9.tvd" IFD [CrawlThread]: delete "_9.fdx" IFD [CrawlThread]: delete "_9.fdt" IW 1 [CrawlThread]: LMP: findMerges: 10 segments IW 1 [CrawlThread]: LMP: level 6.9529195 to 7.7029195: 10 segments IW 1 [CrawlThread]: LMP: 0 to 10: add this merge IW 1 [CrawlThread]: add merge to pendingMerges: _0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 [total 1 pending] IW 1 [CrawlThread]: now merge merge=_0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 into _a merge=org.apache.lucene.index.mergepolicy$oneme...@14171768 index=_0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 IW 1 [CrawlThread]: merging _0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 into _a IW 1 [CrawlThread]: merge: total 18678 docs IW 1 [CrawlThread]: IW: checkAborted IW 1 [CrawlThread]: IW: checkAborted done IW 1 [CrawlThread]: IW: mergedDo
Re: semi-infinite loop during merging
Christiaan Fluit wrote: It seems that it gets up to the point to commit, but the "IW: commitMerge done" message is never reached. Furthermore, no exceptions are printed to the output, so handleMergeException does not seem to have been invoked. Should I add more debug statements elsewhere? I may be on to something already. I just looked at the commitMerge code and was surprised to see that the commitMerge message that is almost at the beginning wasn't printed. Then I saw the "if (hitOOM) return false;" part that takes place before that. I think that this can only mean that an OOME was encountered at some point in time. Now, the fact is that in my indexing code I do a catch(Throwable) in several places. I do this particularly because JET handles OOMEs in a very, very nasty way. Often you will just get an error dialog and then it quits the entire application. Therefore, my client code catches, logs and swallows the OOME before the JET runtime can intercept it. *Usually*, the application can then recover gracefully and continue processing the rest of the information. Catching a OOME that results from the operation of a text extraction library is one thing (and a fact of life really), but perhaps there are also OOME's that occur during Lucene processing. I remember seeing those in the past with the original Java code, when very large Strings were being tokenized and I got an OOME with a deep Lucene stacktrace. I copied one such stacktrace that I have saved at the end of this mail. I see some caught and swallowed OOME's in my log file but unfortunately they are without a stacktrace - probably again a JET issue. I can run the normal Java build though to see if such OOMEs occur on this dataset. Now, I wonder: - when the IW is in auto-commit mode, can the failed processing of a Document due to an OOME have an impact on the processing of subsequent Documents or the merge/optimize operations? Can the index(writer) become corrupt and result in problems such as these? - even though the commitMerge returns false, it should probably not get into an infinite loop. Is this an internal Lucene problem or is there something I can/should do about it myself? - more generally, what is the recommended behavior when I get an OOME during Lucene processing, particularly IW.addDocument? Should the IW be able to recover by itself or is there some sort of rollback I need to perform? Again, note that my index is in auto-commit mode (though I had hoped to let go of that too, it's only for historic reasons). Regards, Chris -- java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.DocumentsWriter.getPostings(DocumentsWriter.java:3069) at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1696) at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1525) at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1412) at org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:1121) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2442) at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2424) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1442) at info.aduna... - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: semi-infinite loop during merging
Michael McCandless wrote: - even though the commitMerge returns false, it should probably not get into an infinite loop. Is this an internal Lucene problem or is there something I can/should do about it myself? Yes, something is wrong with Lucene's handling of OOME. It certainly should not lead to infinite merge attempts. I'll dig (once back from vacation) to see if I can find this path. Likely we need to prevent launching of new merges after an OOME. I think you must've happened to hit OOME when a merge was running. I have some more info. I added message(String) invocations in all places where the IW.hitOOM flag is set, to see which method turns it on. It turned out to be addDocument (twice). These OOME's only happen with the JET build, which explains why the Java build does not show the exploding index behavior: the hitOOM flag is simply never set and the merge is allowed to proceed normally. The flag is definitely not set while the IW is merging, nor do any OOME's appear in my log files during merging. Therefore, there must be a problem in how the merge operation responds to the flag being set. Rollback does not work for me, as my IW is in auto-commit mode. It gives an IllegalStateException when I invoke it. A workaround that does work for me is to close and reopen the IndexWriter immediately after an OOME occurs. Let me know if I can be of any more help. Regards, Chris -- - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exchange/PST/Mail parsing
Hello Grant (cc-ing aperture-devel), I am one of the Aperture admins, I can tell you a bit more about Aperture's mail facilities. Short intro: Aperture is a framework for crawling and full-text and metadata extraction of a growing number of sources and file formats. We try to select the best-of-breed of the large number of open source libraries that tackle a specific source or format (e.g. PDFBox, Poi, JavaMail) and write some glue code around it so that they can be invoked in a uniform way. It's currently used in a number of desktop and enterprise search applications, both research and production systems. At the moment we support a number of mail systems. We can crawl IMAP mail boxes through JavaMail. In general it seems to work well, problems are usually caused by IMAP servers not conforming to the IMAP specs. Some people have used the ImapCrawler to crawl MS Exchange as well. Some succeeded, some didn't. I don't really know whether the fault is in Aperture's code or in the Exchange configuration but I would be happy to take a look at it when someone runs into problems. Outlook can also be crawled by connecting to a running Outlook process through jacob.dll. Others on aperture-devel can tell you more about its current status. Besides this crawler, I would also be very interested in having a crawler that directly processes .pst files, as to stay clear from communicating with other processes outside your own control. People have been working on crawling Thunderbird mailboxes but I don't know what the current status is. Ultimately, we try to support any major mail system. In practice, effort is usually dependent on knowledge and experience as well as customer demand. We are happy to help you out with trying to get Aperture working in your domain and looking into the problems that you may encounter. Kind regards, Chris -- Grant Ingersoll wrote: Anyone have any recommendations on a decent, open (doesn't have to be Apache license, but would prefer non-GPL if possible), extractor for MS Exchange and/or PST files? The Zoe link on the FAQ [1] seems dead. For mbox, I think mstor will suffice for me and I think tropo (from the FAQ should work for IMAP). Does anyone have experience with http://aperture.sourceforge.net/ [1] http://wiki.apache.org/lucene-java/LuceneFAQ#head-bcba2effabe224d5fb8c1761e4da1fedceb9800e Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word files & Build vs. Buy?
Hello all, I'm replying to two threads at once as what I have to say relates to both. My company recently started an open source project called Aperture (http://sourceforge.net/projects/aperture), together with the German DFKI institute. The project is still very much in alpha stage, but I do believe we already have some code parts that could help people here. Basically, it's a framework for crawling information sources (file systems, mail folders, websites, ...) and extracting as much information from it as possible. Besides full-text extraction, we also put a lot of effort in extraction and modeling of the metadata occurring in these sources and document formats. Both parties have some proprietary code lying on the shelf that is being open sourced and ported to the Aperture architecture. Now on to the raised questions: [EMAIL PROTECTED] wrote: WordDocument wd = new WordDocument(is); [EMAIL PROTECTED] wrote: MS Word - I know that POI exists, but development on the Word portion seems to have stopped, and there are a lot of nasty looking bugs in their DB. Since we're involved in dealing with contracts, many of our Word files are large and complicated. How has everyone's experience with POI's Word parsing been? My experience is that the WordDocument class crashes on about 25% of the documents, i.e. it throws some sort of Exception. I've tested POI 2.5.1-final as well as the current code in CVS, but both produce this result. I even suspect the output to be 100% the same, but I haven't verified this. Another reason I don't like this class is that it operates on an InputStream and internally creates a POIFSFileSystem which you cannot access, so that it becomes hard to extract document metadata as well (for which you need the PFSFS) without buffering the entire InputStream. The same applies to TextMining's WordExtractor, which also operates on top of lower level POI components. I've recently committed a WordExtractor to Aperture that uses its own code operating on these lower level POI datastructures, which works a lot better, failing only 5% of my 300 test docs. I don't pretend to understand all the internals of the POI APIs, but it Works For Me. When POI throws an exception, the WordExtractor will revert to applying a heuristic string extraction algorithm to extract as much human-readable text as possible from the binary stream, which works quite well on MS Office files, i.e. the output is reasonably well for indexing purposes. Be sure to checkout Aperture from CVS as this code isn't part of the alpha 1 release. A next official release is expected in a month. [EMAIL PROTECTED] wrote: RTF - javax.swing looks fine, we use those classes already. Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly" because in the past I had many issues with it, typically throwing exceptions on 25-50% of my test documents. Recently I haven't seen a single one (using Java 1.5.0), so either I am now feeding it a more optimal document set or the Swing people have worked on the implementation. In that case people using Java 1.4.x may see different results. Word Perfect - There doesn't seem to be any converters for this format? I'm actively working on this :) We have some proprietary code that will become part of Aperture. Right now I cannot say how well it performs in practice though, although we've never had complaints with our proprietary apps. The code uses a heuristic string extraction algorithm tuned for WordPerfect documents. This may be an issue, e.g. when you also want to display the extraction results to end users. If you're interested: one way you can help me get the most out of it is by sending me some example WordPerfect documents because I hardly have those on my hard drive. Fake documents made with very new or old WordPerfect versions are also most welcome. Regards, Chris http://aduna.biz -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word files & Build vs. Buy?
Nick Burch wrote: You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi. Yes, that's exactly what I'm doing. Having this in POI would benefit me a lot though, as I hardly understand the POI basics to be honest (my fault, not POI's). This is my current code (adapted from Aperture code in CVS): HWPFDocument doc = new HWPFDocument(poiFileSystem); StringBuffer buffer = new StringBuffer(4096); Iterator textPieces = doc.getTextTable().getTextPieces().iterator(); while (textPieces.hasNext()) { TextPiece piece = (TextPiece) textPieces.next(); // the following is derived from // http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406 String encoding = "Cp1252"; if (piece.usesUnicode()) { encoding = "UTF-16LE"; } buffer.append(new String(piece.getRawBytes(), encoding)); } // normalize end-of-line characters and remove any lines // containing macros BufferedReader reader = new BufferedReader(new StringReader(buffer.toString())); buffer.setLength(0); String line; while ((line = reader.readLine()) != null) { if (line.indexOf("DOCPROPERTY") == -1) { buffer.append(line); buffer.append(END_OF_LINE); } } // fetch the extracted full-text String text = buffer.toString(); Regards, Chris -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word files & Build vs. Buy?
Dmitry Goldenberg wrote: Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? I just committed a WordPerfectExtractor ;) It's based on code developed in-house at Aduna and it seems to work quite well on my test collection of WordPerfect documents. Only sometimes words are split in the middle, I'm still looking into that. The test set has a bias for older WordPerfect documents though, I'm trying to get my hands on a recent copy of WordPerfect to see if the latest format is also supported and to create unit tests for it. To interactively test the extractor(s) yourselves: - checkout Aperture from CVS (see http://sourceforge.net/cvs/?group_id=150969) - do "ant release" - go to build\release\bin and execute fileinspector.bat - drag any file (WordPerfect or any other format) to see what MIME type Aperture thinks it is and to execute the corresponding Extractor, if available. The two tabs show the extracted full-text and an RDF dump of the metadata. For WordPerfect, only full-text extraction is currently supported. Our ExcelExtractor is basically nothing more than glue code between POI and the rest of our framework, meaning that an application using the framework can request an Extractor implementation for "application/vnd.ms-excel", feed it an InputStream and get the text and metadata back. The only advantage of our ExcelExtractor over direct use of POI is that, when POI throws an Exception on a particular document, it reverts to a heuristic string extraction algorithm which is often able to extract full-text from a document with reasonable quality, i.e. suited for indexing. We are surely considering supporting more formats. Which ones we will work on depends on a number of factors, e.g. availability of open source libs for that format, complexity of the file format (we did WordPerfect by ourselves), customer demand, code contributions from others, etc. In any case, if you need support for format XYZ, you can always send me some example files and I'll take a look at how hard it is to add support for it. Chris -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Aperture 2006.1 alpha 2 released
A little while ago I announced the existence of the Aperture project, founded by my company together with the DFKI institute. We just released Aperture 2006.1 alpha 2, which may be of interest to all Lucene users dealing with crawling and text extraction. The project page is located at: http://sourceforge.net/projects/aperture To summarize, Aperture now has code for the following tasks: - Crawling of file systems, websites and IMAP folders. An Outlook mailbox crawler is also in the works, any help is welcome. - Text and metadata extraction of a large and growing number of document formats, e.g. MS Office files, MS Works, OpenOffice, OpenDocument, RTF, PDF, WordPerfect, Quattro, Presentations, HTML, XML, plain text... - A robust magic number-based MIME type identifier, a must for choosing the right extractor for a given document. - Security-related classes for handling self-signed certificates when communicating using SSL. Most of the code is already in good shape. The reason that it is still labeled as "alpha" is that we only recently started applying Aperture in our own software, which may still lead to certain (probably minor) API changes. Future plans include continuously extending the set of extractors, e.g. by including extractors for mp3, images, videos, etc., adding support for Thunderbird and other mail clients, support for expanding and crawling archives, address books, ... Furthermore we are working on metadata storage facilities that build upon Lucene and Sesame, a RDF storage and query engine (see www.openrdf.org). This should combine the expressiveness of RDF and the performance and scalability of Sesame with Lucene's full-text indexing capabilities. For questions please consider joining the aperture-devel mailing list. Regards, Christiaan Fluit. -- [EMAIL PROTECTED] Aduna Prinses Julianaplein 14-b 3817 CS Amersfoort The Netherlands +31 33 465 9987 phone +31 33 465 9987 fax http://aduna.biz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene indexing RDF
adasal wrote: As far as i have researched this I know that the gnowsis project uses both rdf and lucene, but I have not had time to determine their relationship. www.gnowsis.org/ I can tell you a bit about Gnowsis, as we (Aduna) are cooperating with the Gnowsis people on RDF creation, storage and querying in the Aperture project (aperture.sourceforge.net). Both the latest Gnowsis beta version and various Aduna products use the Sesame framework (openrdf.org) to store and query RDF. One of Sesame's core interfaces is the Sail (Storage And Inference Layer), which provides an abstraction on a specific type of RDF store, e.g. in-memory, file-based, RDBMS-based, ... We have developed a Sail implementation that combines a file-based RDF storage with a Lucene index. The purpose of this Sail is to provide a means to query both document full-text and metadata using an RDF model. The way we realized this is that document metadata is stored in the RDF store, the full-text and other text-like documents are indexed in Lucene, and the RDF model is extended with a virtual property connecting a Document resource to a query literal that can be used to query the full-text. The dedicated Sail knows that that property should not be looked up in the RDF store but should instead be evaluated as a Lucene query. If you want, I can send you example code that shows how we did this. We have some ideas on generalizing this approach, as ideally you would like to be able to query all RDF literals using Lucene's query facilities while making use of the logical RDF structure (which is what you want, if I understand you correctly), even when the structure of the stored models is not known at development time. However, little work has been done on this. I guess that when we would start working on this, the code for it would end up in either the Sesame or Aperture code base. Chris -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which way to index pdf,word,excel
Have a look at Aperture: http://aperture.sourceforge.net/ It provides components for crawling and text and metadata extraction. It's still in alpha stage though. The development code in CVS has already improved a lot over the last official alpha release. Chris -- James liu wrote: i wanna find frame which can index xml,word,excel,pdf,,,not one. 2006/9/6, Doron Cohen <[EMAIL PROTECTED]>: Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few entries just for this: How can I index HTML documents? How can I index XML documents? How can I index OpenOffice.org files? How can I index MS-Word documents? How can I index MS-Excel documents? How can I index MS-Powerpoint documents? How can I index Email (from MS-Exchange or another IMAP server) ? How can I index RTF documents? How can I index PDF documents? How can I index JSP files? "James liu" <[EMAIL PROTECTED]> wrote on 05/09/2006 19:14:24: > i find lius many question so i wanna give up and find new. > > who recommend ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Met vriendelijke groet, Christiaan Fluit -- Aduna - Guided Exploration www.aduna-software.com Prinses Julianaplein 14-b 3817 CS Amersfoort The Netherlands +31-33-4659987 (office) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]