semi-infinite loop during merging

2009-04-14 Thread Christiaan Fluit

Hello all,

I have a very peculiar problem that is driving me crazy: on some of our 
datasets and at some point in time during indexing, the merge operation 
runs into a (semi-)infinite loop and keeps adding files to the index 
until it runs out of free disk space.


The situation: I have an indexing application that uses Lucene 2.4.1. 
Only one IndexWriter is involved, operating on a FSDirectory and using 
the compound file format. The index is created from scratch. No 
IndexReaders or IndexSearchers are open during indexing (double-checked 
by adding explicit log statements where they are created).


For reasons unrelated to Lucene, the application is compiled with JET, a 
commercial Java Windows compiler. A regular Java build has produced the 
problem only once. The JET build does it every time - unless I keep 
pressing F5 continuously in Windows Explorer on the index dir.


Here is what I see happening in the index dir:

- At first, it builds _0.cfs to _9.cfs without problems. The files vary 
in size between 12 MB and 49 MB and add up to about 250 MB.


- Then, .del files are generated for some of these .cfs files. The 
number of .del files and the .cfs files they correspond with differs 
from time to time. I don't understand why these are created, as no 
IndexReaders exist at this time.


- Then, it generates files called _a.fdt, _a.fdx, _a.fnm, _a.frq, 
_a.nrm, _a.prx, _a.tii, _a.tis, _a.tvd, _a.tvf, _a.tvx. Together these 
files add up to 219 MB. I assume this is the start of the merge of the 
10 .cfs files and this is all still correct.


- Then, it generates _b files with those same extensions, then _c, _d, 
etc. It only keeps generating new files, I never see files disappear. 
The original .cfs files are still there.


- This continues until my hard drive is out of free space. At one test I 
was at _8n and the index had grown from 250MB to 64 GB. Then the 
application just hangs.


Interestingly, after killing the application in this test, there were 
_8k.cfs and _8m.cfs files of 20 MB and 27 MB respectively. No other .cfs 
files existed.


In some older threads on this list (e.g. 
http://marc.info/?l=lucene-user&m=108300530413241&w=2) I read that 
"Win32 seems to sometimes not permit one to delete a file immediately 
after it has been closed". Could this explain the problem? Perhaps the 
JET-compiled app gets to delete the file quicker than when the code is 
running inside a Java VM and therefore runs into this issue? This would 
also explain why pressing F5 during indexing lets the application 
continue: external activity causing some manual delay.


At the end of this mail I have added the output of the InfoStream 
installed on the IndexWriter, showing everything from the start to the 
first few problematic merges. Lines starting with "===" are println's in 
my own code to make sure that indeed only IndexWriters are generated and 
no IndexSearchers/-Readers. The fun starts almost at the bottom, at the 
first line containing "LMP: findMerges: 10 segments". The next lines 
then get repeated over and over again, with different segment names. I 
cannot explain why it mentions "1 deleted docIDs" a couple of lines 
below the first "findMerges: 10" as no IndexWriter.deleteDocuments takes 
place.


As you can see in this output, I am setting a SerialMergeScheduler to 
rule out concurrency issues and making debugging easier. Both 
SerialMergeScheduler and ConcurrentMergeScheduler give this problem though.


I would be grateful if anyone could pose some light on this or can 
advise me on what I can try next. I even considered hacking the 
FSDirectory code and adding some delay in the deleteFile operation to 
see if the above-mentioned win32 issue is the problem, but unless you 
know what you're doing, such hacks can even cause such problems in the 
first place.



Kind regards,

Chris
--

=== Creating new FSDirectory
=== Opening indexWriter (create=true)
IFD [AWT-EventQueue-0]: setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@9fb0688
IW 0 [AWT-EventQueue-0]: setInfoStream: 
dir=org.apache.lucene.store.fsdirect...@d:\index autoCommit=false 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@a11d0d8 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@a3fa9d8 
ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
maxFieldLength=2147483647 index=
IW 0 [AWT-EventQueue-0]: setMergeScheduler 
org.apache.lucene.index.serialmergeschedu...@a3d1fe8
IW 0 [CrawlThread]: DW:   RAM: now balance allocations: usedMB=46.015 vs 
trigger=16 allocMB=46.015 vs trigger=16.8 byteBlockFree=0 charBlockFree=0

IW 0 [CrawlThread]: DW: nothing to free; now set bufferIsFull
IW 0 [CrawlThread]: DW: after free: freedMB=0 usedMB=46.015 
allocMB=46.015
IW 0 [CrawlThread]:   flush: segment=_0 docStoreSegment=_0 
docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=false 
numDocs=2193 numBufDelTerms=0

IW 0 [CrawlThread]:   index before flush
IW 0

Re: semi-infinite loop during merging

2009-04-16 Thread Christiaan Fluit
I spent a lot of time on getting the stacktraces, but JET seems to make 
this impossible. Ctrl-Break, connecting with JConsole, even a "Dump 
Threads" button in my UI that uses Threads.getAllStacktraces were not 
able to produce a dump of all threads.


I just got an additional confirmation that the problem also occurs with 
the Java build, but unfortunately the client's data is too sensitive to 
share it with me.


One option is to hack the Lucene 2.4.1 code to print out some additional 
debug info. Do you known some println's that would help?


Also I suspect that JET-compiled code is able to do Thread.dumpStack 
(but not Thread.getAllStackTraces), so what are good locations for doing 
that? E.g. IndexWriter.merge, etc.



Regards,

Chris
--

Michael McCandless wrote:

Hmmm, very very odd.

First off, your "1 deleted docID" is because one document hit an
exception during indexing, likely in enumerating tokens from the
TokenStream; I see this line:

  IW 0 [CrawlThread]: hit exception adding document

But I think that's fine (certainly should not cause what you are
seeing).

Indeed, it looks like IndexWriter decides it's time to merge the first
10 segments, and it starts that merge, but for some reason before that
merge completes it seems to "recurse" and re-start the merge, to a
different destination segment (_a then _b then _c, etc.).

Ie the first merge never completes (we don't see lines with
"commitMerge: merge=... index=...").

I don't think "not being able to immediately delete on windows" should
cause this; IW simply retries the delete periodically.  I think
something more sinister is at play...

On Unix, one can "kill -SIGQUIT" to get a thread stack trace dump for
all threads; do you know how to do this on Windows?  If so, can you do
that at the end when IW starts doing this infinite merging?  That
would be very helpful towards understanding why this recursion is
happening (though it is spooky that this is all happening under
JET...)

Mike

On Tue, Apr 14, 2009 at 5:28 AM, Christiaan Fluit
 wrote:

Hello all,

I have a very peculiar problem that is driving me crazy: on some of our
datasets and at some point in time during indexing, the merge operation runs
into a (semi-)infinite loop and keeps adding files to the index until it
runs out of free disk space.

The situation: I have an indexing application that uses Lucene 2.4.1. Only
one IndexWriter is involved, operating on a FSDirectory and using the
compound file format. The index is created from scratch. No IndexReaders or
IndexSearchers are open during indexing (double-checked by adding explicit
log statements where they are created).

For reasons unrelated to Lucene, the application is compiled with JET, a
commercial Java Windows compiler. A regular Java build has produced the
problem only once. The JET build does it every time - unless I keep pressing
F5 continuously in Windows Explorer on the index dir.

Here is what I see happening in the index dir:

- At first, it builds _0.cfs to _9.cfs without problems. The files vary in
size between 12 MB and 49 MB and add up to about 250 MB.

- Then, .del files are generated for some of these .cfs files. The number of
.del files and the .cfs files they correspond with differs from time to
time. I don't understand why these are created, as no IndexReaders exist at
this time.

- Then, it generates files called _a.fdt, _a.fdx, _a.fnm, _a.frq, _a.nrm,
_a.prx, _a.tii, _a.tis, _a.tvd, _a.tvf, _a.tvx. Together these files add up
to 219 MB. I assume this is the start of the merge of the 10 .cfs files and
this is all still correct.

- Then, it generates _b files with those same extensions, then _c, _d, etc.
It only keeps generating new files, I never see files disappear. The
original .cfs files are still there.

- This continues until my hard drive is out of free space. At one test I was
at _8n and the index had grown from 250MB to 64 GB. Then the application
just hangs.

Interestingly, after killing the application in this test, there were
_8k.cfs and _8m.cfs files of 20 MB and 27 MB respectively. No other .cfs
files existed.

In some older threads on this list (e.g.
http://marc.info/?l=lucene-user&m=108300530413241&w=2) I read that "Win32
seems to sometimes not permit one to delete a file immediately after it has
been closed". Could this explain the problem? Perhaps the JET-compiled app
gets to delete the file quicker than when the code is running inside a Java
VM and therefore runs into this issue? This would also explain why pressing
F5 during indexing lets the application continue: external activity causing
some manual delay.

At the end of this mail I have added the output of the InfoStream installed
on the IndexWriter, showing everything from the start to the first few
problematic merges. Lines starting with "===" are println's in my own code
to make sure that indeed only I

Re: MergeException

2009-04-21 Thread Christiaan Fluit
I have experienced similar problems (see the "semi-infinite loop during 
merging" thread - still working out the problem): the merger gets into 
an infinite loop and causes my drive to be filled with temporary files 
that are not deleted, until it runs out of space. Sometimes it exits 
with a MergeException wrapping one of a variety of IOExceptions (e.g. a 
FileNotFoundException), sometimes it just keeps on consuming all 
available CPU time.


I think the "_k0z.fnm" file name indicates that a lot of segments have 
already been created, as it starts iterating with _0, _1, ... I don't 
want to jump to conclusions immediately, but this is consistent with a 
merger gone loose.


Was your drive full as well afterwards?

Regards,

Chris
--

Martine Woudstra wrote:

Hi all,

I'm using Lucene 2.4.1. for building an ngram index. Indexing works
well until I try to open the index built so far with Luke. A
MergeException is thrown, see below. Opening an index with Luke during
indexing never caused problems with Lucene 2.3. Anyone familiar with
this problem?

Thanks in advance,

Martine van der Heijden


Exception in thread "Lucene Merge Thread #3067"
org.apache.lucene.index.MergePolicy$MergeException:
java.io.FileNotFoundException: D:\indexngram\_k0z.fnm (Het systeem kan
het opgegeven bestand niet vinden)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:309)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:286)
Caused by: java.io.FileNotFoundException: D:\indexngram\_k0z.fnm (Het
systeem kan het opgegeven bestand niet vinden)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.(FSDirectory.java:552)
at 
org.apache.lucene.store.FSDirectory$FSIndexInput.(FSDirectory.java:582)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:221)
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:184)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:204)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4263)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MergeException

2009-04-21 Thread Christiaan Fluit

Michael McCandless wrote:

On Tue, Apr 21, 2009 at 4:26 PM, Christiaan Fluit
 wrote:

I have experienced similar problems (see the "semi-infinite loop during
merging" thread - still working out the problem): the merger gets into an
infinite loop and causes my drive to be filled with temporary files that are
not deleted, until it runs out of space.


Christiaan, I responded on that thread on where to sprinkle prints...
any update?


I have some new results. I will post them in the original thread.


Regards,

Chris
--

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: semi-infinite loop during merging

2009-04-21 Thread Christiaan Fluit

Michael McCandless wrote:

One question: are you using IndexWriter.close(false)?  I wonder if
there's some path whereby the merges fail to abort (and simply keep
retrying) if you do that...


No, I don't.


More inlined below...

On Thu, Apr 16, 2009 at 5:42 AM, Christiaan Fluit
 wrote:

I spent a lot of time on getting the stacktraces, but JET seems to make this
impossible. Ctrl-Break, connecting with JConsole, even a "Dump Threads"
button in my UI that uses Threads.getAllStacktraces were not able to produce
a dump of all threads.


Sigh.


I just got an additional confirmation that the problem also occurs with the
Java build, but unfortunately the client's data is too sensitive to share it
with me.


Could they "kill -QUIT" or  Ctrl-Break when it's happening?


The confirmation was received recently but the problem occurred a while 
ago and can't be reproduced now.



One option is to hack the Lucene 2.4.1 code to print out some additional
debug info. Do you known some println's that would help?


How about adding this at the top of IndexWriter.handleMergeException:

  if (infoStream != null) {
message("handleMergeException: merge=" +
merge.segString(directory) + " exc=" + t);
  }

?

You could also sprinkle prints in between where message("merge:
total"...) occurs and the call to commitMerge, in
IndexWriter.mergeMiddle.  The mystery here is why you see these two
lines in a row:

  IW 0 [CrawlThread]: merge: total 18586 docs
  IW 0 [CrawlThread]: LMP: findMerges: 10 segments

That first line is supposed to be followed by a "commitMerge: ..."
line.  I suspect some exception (maybe a MergeAbortedException) is
being hit, leading to the commit not happening.


Also I suspect that JET-compiled code is able to do Thread.dumpStack (but
not Thread.getAllStackTraces), so what are good locations for doing that?
E.g. IndexWriter.merge, etc.


This would be great -- I would add that in handleMergeException as well.


A test run with the modified code just completed minutes ago.

I modified IndexWriter in two places. IW.handleMergeException starts 
with the following code:


  if (infoStream != null) {
message("handleMergeException: merge=" + merge.segString(directory)
+ " exc=" + t);

message("dumping current stack");
new Exception("Stack trace").printStackTrace(infoStream);

message("dumping throwable");
t.printStackTrace(infoStream);
  }

To IW.mergeMiddle I added eight message(String) invocations:

[...]
  try {
[...]

message("IW: checkAborted");
merge.checkAborted(directory);
message("IW: checkAborted done");

// This is where all the work happens:
message("IW: mergedDocCount");
mergedDocCount = merge.info.docCount = 
merger.merge(merge.mergeDocStores);

message("IW: mergedDocCount done");

assert mergedDocCount == totDocCount;
} finally {
  // close readers before we attempt to delete
  // now-obsolete segments
  if (merger != null) {
message("IW: closeReaders");
merger.closeReaders();
message("IW: closeReaders done");
  }
}

message("IW: commitMerge");
if (!commitMerge(merge, merger, mergedDocCount))
  // commitMerge will return false if this merge was aborted
  return 0;
message("IW: commitMerge done");
[...]

This gives the following output:

[...]
IFD [CrawlThread]: delete "_9.fnm"
IFD [CrawlThread]: delete "_9.frq"
IFD [CrawlThread]: delete "_9.prx"
IFD [CrawlThread]: delete "_9.tis"
IFD [CrawlThread]: delete "_9.tii"
IFD [CrawlThread]: delete "_9.nrm"
IFD [CrawlThread]: delete "_9.tvx"
IFD [CrawlThread]: delete "_9.tvf"
IFD [CrawlThread]: delete "_9.tvd"
IFD [CrawlThread]: delete "_9.fdx"
IFD [CrawlThread]: delete "_9.fdt"
IW 1 [CrawlThread]: LMP: findMerges: 10 segments
IW 1 [CrawlThread]: LMP:   level 6.9529195 to 7.7029195: 10 segments
IW 1 [CrawlThread]: LMP: 0 to 10: add this merge
IW 1 [CrawlThread]: add merge to pendingMerges: _0:c2201 _1:c1806 
_2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 
[total 1 pending]

IW 1 [CrawlThread]: now merge
  merge=_0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 
_7:c2946 _8:c3129 _9:c3429 into _a

  merge=org.apache.lucene.index.mergepolicy$oneme...@14171768
  index=_0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 _5:c812 _6:c1737 
_7:c2946 _8:c3129 _9:c3429
IW 1 [CrawlThread]: merging _0:c2201 _1:c1806 _2:c1023 _3:c430 _4:c1166 
_5:c812 _6:c1737 _7:c2946 _8:c3129 _9:c3429 into _a

IW 1 [CrawlThread]: merge: total 18678 docs
IW 1 [CrawlThread]: IW: checkAborted
IW 1 [CrawlThread]: IW: checkAborted done
IW 1 [CrawlThread]: IW: mergedDo

Re: semi-infinite loop during merging

2009-04-21 Thread Christiaan Fluit

Christiaan Fluit wrote:
It seems that it gets up to the point to commit, but the "IW: 
commitMerge done" message is never reached.


Furthermore, no exceptions are printed to the output, so 
handleMergeException does not seem to have been invoked.


Should I add more debug statements elsewhere?


I may be on to something already.

I just looked at the commitMerge code and was surprised to see that the 
commitMerge message that is almost at the beginning wasn't printed. Then 
I saw the "if (hitOOM) return false;" part that takes place before that. 
I think that this can only mean that an OOME was encountered at some 
point in time.


Now, the fact is that in my indexing code I do a catch(Throwable) in 
several places. I do this particularly because JET handles OOMEs in a 
very, very nasty way. Often you will just get an error dialog and then 
it quits the entire application. Therefore, my client code catches, logs 
and swallows the OOME before the JET runtime can intercept it. 
*Usually*, the application can then recover gracefully and continue 
processing the rest of the information.


Catching a OOME that results from the operation of a text extraction 
library is one thing (and a fact of life really), but perhaps there are 
also OOME's that occur during Lucene processing.


I remember seeing those in the past with the original Java code, when 
very large Strings were being tokenized and I got an OOME with a deep 
Lucene stacktrace. I copied one such stacktrace that I have saved at the 
end of this mail.


I see some caught and swallowed OOME's in my log file but unfortunately 
they are without a stacktrace - probably again a JET issue. I can run 
the normal Java build though to see if such OOMEs occur on this dataset.


Now, I wonder:

- when the IW is in auto-commit mode, can the failed processing of a 
Document due to an OOME have an impact on the processing of subsequent 
Documents or the merge/optimize operations? Can the index(writer) become 
corrupt and result in problems such as these?


- even though the commitMerge returns false, it should probably not get 
into an infinite loop. Is this an internal Lucene problem or is there 
something I can/should do about it myself?


- more generally, what is the recommended behavior when I get an OOME 
during Lucene processing, particularly IW.addDocument? Should the IW be 
able to recover by itself or is there some sort of rollback I need to 
perform? Again, note that my index is in auto-commit mode (though I had 
hoped to let go of that too, it's only for historic reasons).



Regards,

Chris
--

java.lang.OutOfMemoryError: Java heap space
	at 
org.apache.lucene.index.DocumentsWriter.getPostings(DocumentsWriter.java:3069)
	at 
org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1696)
	at 
org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1525)
	at 
org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1412)
	at 
org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:1121)
	at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2442)
	at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2424)

at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1464)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1442)
at info.aduna...

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: semi-infinite loop during merging

2009-04-24 Thread Christiaan Fluit

Michael McCandless wrote:

- even though the commitMerge returns false, it should probably not get into
an infinite loop. Is this an internal Lucene problem or is there something I
can/should do about it myself?


Yes, something is wrong with Lucene's handling of OOME.  It certainly
should not lead to infinite merge attempts.  I'll dig (once back from
vacation) to see if I can find this path.  Likely we need to prevent
launching of new merges after an OOME.  I think you must've happened
to hit OOME when a merge was running.


I have some more info.

I added message(String) invocations in all places where the IW.hitOOM 
flag is set, to see which method turns it on. It turned out to be 
addDocument (twice). These OOME's only happen with the JET build, which 
explains why the Java build does not show the exploding index behavior: 
the hitOOM flag is simply never set and the merge is allowed to proceed 
normally.


The flag is definitely not set while the IW is merging, nor do any 
OOME's appear in my log files during merging. Therefore, there must be a 
problem in how the merge operation responds to the flag being set.


Rollback does not work for me, as my IW is in auto-commit mode. It gives 
an IllegalStateException when I invoke it.


A workaround that does work for me is to close and reopen the 
IndexWriter immediately after an OOME occurs.


Let me know if I can be of any more help.


Regards,

Chris
--

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exchange/PST/Mail parsing

2007-07-02 Thread Christiaan Fluit

Hello Grant (cc-ing aperture-devel),

I am one of the Aperture admins, I can tell you a bit more about 
Aperture's mail facilities.


Short intro: Aperture is a framework for crawling and full-text and 
metadata extraction of a growing number of sources and file formats. We 
try to select the best-of-breed of the large number of open source 
libraries that tackle a specific source or format (e.g. PDFBox, Poi, 
JavaMail) and write some glue code around it so that they can be invoked 
in a uniform way. It's currently used in a number of desktop and 
enterprise search applications, both research and production systems.


At the moment we support a number of mail systems.

We can crawl IMAP mail boxes through JavaMail. In general it seems to 
work well, problems are usually caused by IMAP servers not conforming to 
the IMAP specs.


Some people have used the ImapCrawler to crawl MS Exchange as well. Some 
succeeded, some didn't. I don't really know whether the fault is in 
Aperture's code or in the Exchange configuration but I would be happy to 
take a look at it when someone runs into problems.


Outlook can also be crawled by connecting to a running Outlook process 
through jacob.dll. Others on aperture-devel can tell you more about its 
current status. Besides this crawler, I would also be very interested in 
having a crawler that directly processes .pst files, as to stay clear 
from communicating with other processes outside your own control.


People have been working on crawling Thunderbird mailboxes but I don't 
know what the current status is.


Ultimately, we try to support any major mail system. In practice, effort 
is usually dependent on knowledge and experience as well as customer demand.


We are happy to help you out with trying to get Aperture working in your 
domain and looking into the problems that you may encounter.



Kind regards,

Chris
--

Grant Ingersoll wrote:
Anyone have any recommendations on a decent, open (doesn't have to be 
Apache license, but would prefer non-GPL if possible), extractor for MS 
Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.


For mbox, I think mstor will suffice for me and I think tropo (from the 
FAQ should work for IMAP).  Does anyone have experience with 
http://aperture.sourceforge.net/


[1] 
http://wiki.apache.org/lucene-java/LuceneFAQ#head-bcba2effabe224d5fb8c1761e4da1fedceb9800e 



Cheers,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit

Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture 
(http://sourceforge.net/projects/aperture), together with the German 
DFKI institute. The project is still very much in alpha stage, but I do 
believe we already have some code parts that could help people here.


Basically, it's a framework for crawling information sources (file 
systems, mail folders, websites, ...) and extracting as much information 
from it as possible. Besides full-text extraction, we also put a lot of 
effort in extraction and modeling of the metadata occurring in these 
sources and document formats. Both parties have some proprietary code 
lying on the shelf that is being open sourced and ported to the Aperture 
architecture.


Now on to the raised questions:

[EMAIL PROTECTED] wrote:

WordDocument wd = new WordDocument(is);


[EMAIL PROTECTED] wrote:

MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB.  Since we're involved in dealing with contracts, many of our
Word files are large and complicated.  How has everyone's experience
with POI's Word parsing been?


My experience is that the WordDocument class crashes on about 25% of the 
documents, i.e. it throws some sort of Exception. I've tested POI 
2.5.1-final as well as the current code in CVS, but both produce this 
result. I even suspect the output to be 100% the same, but I haven't 
verified this.


Another reason I don't like this class is that it operates on an 
InputStream and internally creates a POIFSFileSystem which you cannot 
access, so that it becomes hard to extract document metadata as well 
(for which you need the PFSFS) without buffering the entire InputStream. 
The same applies to TextMining's WordExtractor, which also operates on 
top of lower level POI components.


I've recently committed a WordExtractor to Aperture that uses its own 
code operating on these lower level POI datastructures, which works a 
lot better, failing only 5% of my 300 test docs. I don't pretend to 
understand all the internals of the POI APIs, but it Works For Me.


When POI throws an exception, the WordExtractor will revert to applying 
a heuristic string extraction algorithm to extract as much 
human-readable text as possible from the binary stream, which works 
quite well on MS Office files, i.e. the output is reasonably well for 
indexing purposes.


Be sure to checkout Aperture from CVS as this code isn't part of the 
alpha 1 release. A next official release is expected in a month.


[EMAIL PROTECTED] wrote:

RTF - javax.swing looks fine, we use those classes already.


Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly" 
because in the past I had many issues with it, typically throwing 
exceptions on 25-50% of my test documents. Recently I haven't seen a 
single one (using Java 1.5.0), so either I am now feeding it a more 
optimal document set or the Swing people have worked on the 
implementation. In that case people using Java 1.4.x may see different 
results.



Word Perfect - There doesn't seem to be any converters for this format?


I'm actively working on this :) We have some proprietary code that will 
become part of Aperture. Right now I cannot say how well it performs in 
practice though, although we've never had complaints with our 
proprietary apps.


The code uses a heuristic string extraction algorithm tuned for 
WordPerfect documents. This may be an issue, e.g. when you also want to 
display the extraction results to end users.


If you're interested: one way you can help me get the most out of it is 
by sending me some example WordPerfect documents because I hardly have 
those on my hard drive. Fake documents made with very new or old 
WordPerfect versions are also most welcome.



Regards,

Chris
http://aduna.biz
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit

Nick Burch wrote:
You could try using org.apache.poi.hwpf.HWPFDocument, and getting the 
range, then the paragraphs, and grab the text from each paragraph. If 
there's interest, I could probably commit an extractor that does this to 
poi.


Yes, that's exactly what I'm doing. Having this in POI would benefit me 
a lot though, as I hardly understand the POI basics to be honest (my 
fault, not POI's).


This is my current code (adapted from Aperture code in CVS):

HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);

Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
TextPiece piece = (TextPiece) textPieces.next();

// the following is derived from
// http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
String encoding = "Cp1252";
if (piece.usesUnicode()) {
encoding = "UTF-16LE";
}

buffer.append(new String(piece.getRawBytes(), encoding));
}

// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
StringReader(buffer.toString()));
buffer.setLength(0);

String line;
while ((line = reader.readLine()) != null) {
if (line.indexOf("DOCPROPERTY") == -1) {
buffer.append(line);
buffer.append(END_OF_LINE);
}
}

// fetch the extracted full-text
String text = buffer.toString();


Regards,

Chris
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word files & Build vs. Buy?

2006-02-10 Thread Christiaan Fluit

Dmitry Goldenberg wrote:

Awesome stuff. A few questions: is your Excel extractor somehow
better than POI's? and, what do you see as the timeframe for adding
WordPerfect support? Are you considering supporting any other sources
such as MS Project, Framemaker, etc?


I just committed a WordPerfectExtractor ;)

It's based on code developed in-house at Aduna and it seems to work 
quite well on my test collection of WordPerfect documents. Only 
sometimes words are split in the middle, I'm still looking into that.


The test set has a bias for older WordPerfect documents though, I'm 
trying to get my hands on a recent copy of WordPerfect to see if the 
latest format is also supported and to create unit tests for it.


To interactively test the extractor(s) yourselves:

- checkout Aperture from CVS (see 
http://sourceforge.net/cvs/?group_id=150969)

- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type 
Aperture thinks it is and to execute the corresponding Extractor, if 
available. The two tabs show the extracted full-text and an RDF dump of 
the metadata. For WordPerfect, only full-text extraction is currently 
supported.


Our ExcelExtractor is basically nothing more than glue code between POI 
and the rest of our framework, meaning that an application using the 
framework can request an Extractor implementation for 
"application/vnd.ms-excel", feed it an InputStream and get the text and 
metadata back.


The only advantage of our ExcelExtractor over direct use of POI is that, 
when POI throws an Exception on a particular document, it reverts to a 
heuristic string extraction algorithm which is often able to extract 
full-text from a document with reasonable quality, i.e. suited for indexing.


We are surely considering supporting more formats. Which ones we will 
work on depends on a number of factors, e.g. availability of open source 
libs for that format, complexity of the file format (we did WordPerfect 
by ourselves), customer demand, code contributions from others, etc. In 
any case, if you need support for format XYZ, you can always send me 
some example files and I'll take a look at how hard it is to add support 
for it.



Chris
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Aperture 2006.1 alpha 2 released

2006-03-09 Thread Christiaan Fluit
A little while ago I announced the existence of the Aperture project, 
founded by my company together with the DFKI institute.


We just released Aperture 2006.1 alpha 2, which may be of interest to 
all Lucene users dealing with crawling and text extraction.


The project page is located at:

http://sourceforge.net/projects/aperture

To summarize, Aperture now has code for the following tasks:

- Crawling of file systems, websites and IMAP folders. An Outlook 
mailbox crawler is also in the works, any help is welcome.


- Text and metadata extraction of a large and growing number of document 
formats, e.g. MS Office files, MS Works, OpenOffice, OpenDocument, RTF, 
PDF, WordPerfect, Quattro, Presentations, HTML, XML, plain text...


- A robust magic number-based MIME type identifier, a must for choosing 
the right extractor for a given document.


- Security-related classes for handling self-signed certificates when 
communicating using SSL.


Most of the code is already in good shape. The reason that it is still 
labeled as "alpha" is that we only recently started applying Aperture in 
our own software, which may still lead to certain (probably minor) API 
changes.


Future plans include continuously extending the set of extractors, e.g. 
by including extractors for mp3, images, videos, etc., adding support 
for Thunderbird and other mail clients, support for expanding and 
crawling archives, address books, ...


Furthermore we are working on metadata storage facilities that build 
upon Lucene and Sesame, a RDF storage and query engine (see 
www.openrdf.org). This should combine the expressiveness of RDF and the 
performance and scalability of Sesame with Lucene's full-text indexing 
capabilities.


For questions please consider joining the aperture-devel mailing list.


Regards,

Christiaan Fluit.

--
[EMAIL PROTECTED]

Aduna
Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands

+31 33 465 9987 phone
+31 33 465 9987 fax

http://aduna.biz

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene indexing RDF

2006-06-28 Thread Christiaan Fluit

adasal wrote:

As far as i have researched this I know that the gnowsis project uses both
rdf and lucene, but I have not had time to determine their relationship.
www.gnowsis.org/


I can tell you a bit about Gnowsis, as we (Aduna) are cooperating with 
the Gnowsis people on RDF creation, storage and querying in the Aperture 
project (aperture.sourceforge.net).


Both the latest Gnowsis beta version and various Aduna products use the 
Sesame framework (openrdf.org) to store and query RDF.


One of Sesame's core interfaces is the Sail (Storage And Inference 
Layer), which provides an abstraction on a specific type of RDF store, 
e.g. in-memory, file-based, RDBMS-based, ...


We have developed a Sail implementation that combines a file-based RDF 
storage with a Lucene index. The purpose of this Sail is to provide a 
means to query both document full-text and metadata using an RDF model.


The way we realized this is that document metadata is stored in the RDF 
store, the full-text and other text-like documents are indexed in 
Lucene, and the RDF model is extended with a virtual property connecting 
a Document resource to a query literal that can be used to query the 
full-text. The dedicated Sail knows that that property should not be 
looked up in the RDF store but should instead be evaluated as a Lucene 
query. If you want, I can send you example code that shows how we did this.


We have some ideas on generalizing this approach, as ideally you would 
like to be able to query all RDF literals using Lucene's query 
facilities while making use of the logical RDF structure (which is what 
you want, if I understand you correctly), even when the structure of the 
stored models is not known at development time. However, little work has 
been done on this. I guess that when we would start working on this, the 
code for it would end up in either the Sesame or Aperture code base.



Chris
--


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which way to index pdf,word,excel

2006-09-06 Thread Christiaan Fluit

Have a look at Aperture: http://aperture.sourceforge.net/
It provides components for crawling and text and metadata extraction. 
It's still in alpha stage though. The development code in CVS has 
already improved a lot over the last official alpha release.


Chris
--

James liu wrote:

i wanna find frame which can index xml,word,excel,pdf,,,not one.


2006/9/6, Doron Cohen <[EMAIL PROTECTED]>:


Lucene FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ - has a few
entries just for this:

  How can I index HTML documents?
  How can I index XML documents?
  How can I index OpenOffice.org files?
  How can I index MS-Word documents?
  How can I index MS-Excel documents?
  How can I index MS-Powerpoint documents?
  How can I index Email (from MS-Exchange or another IMAP server) ?
  How can I index RTF documents?
  How can I index PDF documents?
  How can I index JSP files?


"James liu" <[EMAIL PROTECTED]> wrote on 05/09/2006 19:14:24:

> i find lius many question so i wanna give up and find new.
>
> who recommend ?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







Met vriendelijke groet,

Christiaan Fluit
--
Aduna - Guided Exploration
www.aduna-software.com

Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands
+31-33-4659987 (office)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]