Re: index word files ( doc )

2007-03-28 Thread John Haxby
Daniel Noll wrote: The only screenshots I can see look like plain text to me, and I'm currently working on something which needs to convert Word to HTML, which is why I ask. wvWare, which I mentioned earlier, can convert word to HTML and does a pretty good job of maintaining formatting. abiwor

Re: index word files ( doc )

2007-03-26 Thread Daniel Noll
Ryan Ackley wrote: >> Any comments on this are appreciated. One thing I thought of would be >> to continue to offer the text extraction as open source but add html >> conversion with hit highlighting for a variety of file formats as a >> commercial add on. Is this something anyone would pay for? W

Re: index word files ( doc )

2007-03-26 Thread Antony Bowesman
Ryan Ackley wrote: The 512 byte thing is a limitation of POIFS I think. I could be wrong though. Have you tried opening the file with just POIFS? It was some time ago, but it looks like I used both org.apache.poi.hwpf.extractor.WordExtractor org.apache.poi.hdf.extractor.WordDocument with the

Re: index word files ( doc )

2007-03-26 Thread John Haxby
John Haxby wrote: Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abiword is an edi

Re: index word files ( doc )

2007-03-26 Thread John Haxby
Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abiword is an editor, but it also do

Re: index word files ( doc )

2007-03-26 Thread Ryan Ackley
The 512 byte thing is a limitation of POIFS I think. I could be wrong though. Have you tried opening the file with just POIFS? On 3/26/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: Ryan Ackley wrote: > Yes I do have plans for adding fast save support and support for more > file formats. The tim

Re: index word files ( doc )

2007-03-26 Thread Ryan Ackley
g Ryan's textmining in prefence to the POI as internally > TM uses > > POI and the Word6 extractor so handles a greater variety of files. > > > > Ryan, thanks for fixing your site. Do you have any plans/ideas on how > to parse > > the 'fast-saved' fil

Re: index word files ( doc )

2007-03-26 Thread jafarim
anks for fixing your site. Do you have any plans/ideas on how to parse > the 'fast-saved' files and any ideas on Word files older than the Word 6 format? > > Regards > Antony > > > Ryan Ackley wrote: > > As the author of both Word POI and textmining.org, I recomme

Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman
Ryan Ackley wrote: Yes I do have plans for adding fast save support and support for more file formats. The time frame for this happening is the next couple of months. That would be good when it comes. It would be nice if it could handle a 'brute force' mode where in the event of problems, it

Re: index word files ( doc )

2007-03-25 Thread Daniel Noll
Ryan Ackley wrote: As the author of both Word POI and textmining.org, I recommend using textmining.org. POI is for general purpose manipulation of Word documents. textmining's only purpose is extracting text. I wish the two would collaborate though. It's true that POI contains code for writin

Re: index word files ( doc )

2007-03-25 Thread Ryan Ackley
so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the Word 6 format? Regards Antony Ryan Ackley wrote: > As the author of both Word POI and

Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman
I've been using Ryan's textmining in prefence to the POI as internally TM uses POI and the Word6 extractor so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word

Re: index word files ( doc )

2007-03-24 Thread Ryan Ackley
As the author of both Word POI and textmining.org, I recommend using textmining.org. POI is for general purpose manipulation of Word documents. textmining's only purpose is extracting text. Also, people recommend using POI for text extraction but the only place I've seen an actual how-to on this

Re: index word files ( doc )

2007-03-24 Thread jafarim
Can anyone make a comparison between the two, namely POI API and the one from textmining.org? On 3/24/07, Ryan Ackley <[EMAIL PROTECTED]> wrote: The site is down but you can download the word extractor library direct here: http://www.textmining.org/textmining.zip Going to fix the site this we

Re: index word files ( doc )

2007-03-24 Thread Ryan Ackley
The site is down but you can download the word extractor library direct here: http://www.textmining.org/textmining.zip Going to fix the site this weekend. On 3/24/07, Sami Siren <[EMAIL PROTECTED]> wrote: Antony Bowesman wrote: >> Are there other sollutions? There's also antiword [1] which c

Re: index word files ( doc )

2007-03-23 Thread Sami Siren
Antony Bowesman wrote: >> Are there other sollutions? There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. -- Sami Siren [1] http://www.winfield.demon.nl/ - To unsubscribe, e-mai

Re: index word files ( doc )

2007-03-23 Thread Antony Bowesman
www.textmining.org, but the site is no longer accessible. Check Nutch which has a Word parser - it seems to be the original textmining.org Word6+POI parser. Pre-word6 and "fast-saved" files will not work. I've not found a solution for those Antony [EMAIL PROTECTED] wrote: Thank you, Are

Re: index word files ( doc )

2007-03-23 Thread Otis Gospodnetic
: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, March 23, 2007 5:03:32 PM Subject: RE: index word files ( doc ) Thank you, Are there other sollutions? Van: jafarim [mailto:[EMAIL PROTECTED] Verzonden: vr 23-

RE: index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem
Thank you, Are there other sollutions? Van: jafarim [mailto:[EMAIL PROTECTED] Verzonden: vr 23-3-2007 18:55 Aan: java-user@lucene.apache.org Onderwerp: Re: index word files ( doc ) Hi My experience is not much satisfactory. It breaks very easily on many

Re: index word files ( doc )

2007-03-23 Thread jafarim
Hi My experience is not much satisfactory. It breaks very easily on many files. On 3/23/07, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an

index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem
Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an early stage. Is POI advisable? Or are there better alternatives? Please give some advice. Regards, Erik

Re: Word files & Build vs. Buy?

2006-02-14 Thread Nick Burch
On Thu, 9 Feb 2006, Christiaan Fluit wrote: Yes, that's exactly what I'm doing. Having this in POI would benefit me a lot though, as I hardly understand the POI basics to be honest (my fault, not POI's). OK, that's now in POI (you'll need a scratchpad build from late yesterday or today, see h

Re: Word files & Build vs. Buy?

2006-02-10 Thread Christiaan Fluit
Dmitry Goldenberg wrote: Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? I just committed a WordPerfectE

Re: Word files

2006-02-09 Thread Otis Gospodnetic
EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thu 09 Feb 2006 01:36:47 PM EST Subject: Word files Hello, I use the Poi Api to parse MSword files in order to index the content to enable lucene search. For that I download the last jars from Poi (including the scratchdpad one) and use

RE: Word files & Build vs. Buy?

2006-02-09 Thread Dmitry Goldenberg
mitry From: Christiaan Fluit [mailto:[EMAIL PROTECTED] Sent: Thu 2/9/2006 4:09 AM To: java-user@lucene.apache.org Subject: Re: Word files & Build vs. Buy? Hello all, I'm replying to two threads at once as what I have to say relates to both. My company recently started an open

Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit
Nick Burch wrote: You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi. Yes, that's exactly what I'm doing. Having this in POI wo

Re: Word files & Build vs. Buy?

2006-02-09 Thread Nick Burch
On Thu, 9 Feb 2006, Christiaan Fluit wrote: My experience is that the WordDocument class crashes on about 25% of the documents, i.e. it throws some sort of Exception. I've tested POI 2.5.1-final as well as the current code in CVS, but both produce this result. I even suspect the output to be 10

Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit
t wd = new WordDocument(is); [EMAIL PROTECTED] wrote: MS Word - I know that POI exists, but development on the Word portion seems to have stopped, and there are a lot of nasty looking bugs in their DB. Since we're involved in dealing with contracts, many of our Word files are large and co

Word files

2006-02-09 Thread arnaudbuffet
Hello, I use the Poi Api to parse MSword files in order to index the content to enable lucene search. For that I download the last jars from Poi (including the scratchdpad one) and use the parser from lucenebook called POIWordDocHandler. It works quiet good, but for some files the parser does